HPC with the NVIDIA Accelerated Computing Toolkit Mark Harris, November 16, 2015

Similar documents
INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

OpenACC Course Lecture 1: Introduction to OpenACC September 2015

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

ACCELERATED COMPUTING: THE PATH FORWARD. Jen-Hsun Huang, Co-Founder and CEO, NVIDIA SC15 Nov. 16, 2015

High-Productivity CUDA Programming. Cliff Woolley, Sr. Developer Technology Engineer, NVIDIA

April 4-7, 2016 Silicon Valley INSIDE PASCAL. Mark Harris, October 27,

NVIDIA GPU TECHNOLOGY UPDATE

High-Productivity CUDA Programming. Levi Barnes, Developer Technology Engineer, NVIDIA

CUDA 5 and Beyond. Mark Ebersole. Original Slides: Mark Harris 2012 NVIDIA

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016

NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY. Peter Messmer

CUDA 7.5 OVERVIEW WEBINAR 7/23/15

DEEP NEURAL NETWORKS AND GPUS. Julie Bernauer

LECTURE ON PASCAL GPU ARCHITECTURE. Jiri Kraus, November 14 th 2016

Future Directions for CUDA Presented by Robert Strzodka

Introduction to CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator

RECENT UPDATES ON ACCELERATING COMPUTING PLATFORM PRADEEP GUPTA SENIOR SOLUTIONS ARCHITECT, NVIDIA

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

NEW FEATURES IN CUDA 6 MAKE GPU ACCELERATION EASIER MARK HARRIS

Ian Buck, GM GPU Computing Software

Technology for a better society. hetcomp.com

RECENT TRENDS IN GPU ARCHITECTURES. Perspectives of GPU computing in Science, 26 th Sept 2016

ACCELERATING SANJEEVINI: A DRUG DISCOVERY SOFTWARE SUITE

TESLA V100 PERFORMANCE GUIDE. Life Sciences Applications

May 8-11, 2017 Silicon Valley CUDA 9 AND BEYOND. Mark Harris, May 10, 2017

Introduction to GPU Computing. 周国峰 Wuhan University 2017/10/13

GPU-Accelerated Deep Learning

S CUDA on Xavier

GPU ACCELERATED COMPUTING. 1 st AlsaCalcul GPU Challenge, 14-Jun-2016, Strasbourg Frédéric Parienté, Tesla Accelerated Computing, NVIDIA Corporation

Deep Learning: Transforming Engineering and Science The MathWorks, Inc.

EXTENDING THE REACH OF PARALLEL COMPUTING WITH CUDA

CUDA 7 AND BEYOND MARK HARRIS, NVIDIA

Tesla GPU Computing A Revolution in High Performance Computing

COMP Parallel Computing. Programming Accelerators using Directives

Programming NVIDIA GPUs with OpenACC Directives

An Introduc+on to OpenACC Part II

GPU programming basics. Prof. Marco Bertini

Approaches to GPU Computing. Libraries, OpenACC Directives, and Languages

Accelerator programming with OpenACC

Turing Architecture and CUDA 10 New Features. Minseok Lee, Developer Technology Engineer, NVIDIA

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Directive-based Programming for Highly-scalable Nodes

CUDA. Matthew Joyner, Jeremy Williams

Languages, Libraries and Development Tools for GPU Computing

GPU COMPUTING AND THE FUTURE OF HPC. Timothy Lanfear, NVIDIA

TESLA P100 PERFORMANCE GUIDE. HPC and Deep Learning Applications

TESLA P100 PERFORMANCE GUIDE. Deep Learning and HPC Applications

Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

Object recognition and computer vision using MATLAB and NVIDIA Deep Learning SDK

CUDA Update: Present & Future. Mark Ebersole, NVIDIA CUDA Educator

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

The Visual Computing Company

Optimising the Mantevo benchmark suite for multi- and many-core architectures

CUDA: NEW AND UPCOMING FEATURES

CUDA Toolkit 5.0 Performance Report. January 2013

Unified Memory. Notes on GPU Data Transfers. Andreas Herten, Forschungszentrum Jülich, 24 April Member of the Helmholtz Association

May 8-11, 2017 Silicon Valley. CUDA 9 AND BEYOND Mark Harris, May 10, 2017

ARCHER Champions 2 workshop

VSC Users Day 2018 Start to GPU Ehsan Moravveji

CUDA 6.0. Manuel Ujaldón Associate Professor, Univ. of Malaga (Spain) Conjoint Senior Lecturer, Univ. of Newcastle (Australia) Nvidia CUDA Fellow

Tesla GPU Computing A Revolution in High Performance Computing

Trends in HPC (hardware complexity and software challenges)

INTRODUCTION TO GPU PROGRAMMING. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

In-Situ Statistical Analysis of Autotune Simulation Data using Graphical Processing Units

Getting Started with CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator

DEEP LEARNING AND DIGITS DEEP LEARNING GPU TRAINING SYSTEM

Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

High Performance Computing with Accelerators

CUDA 7.0 Performance Report. May 2015

OpenACC Course. Office Hour #2 Q&A

n N c CIni.o ewsrg.au

General Purpose GPU Computing in Partial Wave Analysis

GPU Computing with NVIDIA s new Kepler Architecture

NEW DEVELOPER TOOLS FEATURES IN CUDA 8.0. Sanjiv Satoor

S Comparing OpenACC 2.5 and OpenMP 4.5

TESLA ACCELERATED COMPUTING. Mike Wang Solutions Architect NVIDIA Australia & NZ

OPTIMIZED GPU KERNELS FOR DEEP LEARNING. Amir Khosrowshahi

Using JURECA's GPU Nodes

DEEP LEARNING ALISON B LOWNDES. Deep Learning Solutions Architect & Community Manager EMEA

DGX UPDATE. Customer Presentation Deck May 8, 2017

CSE 599 I Accelerated Computing Programming GPUS. Intro to CUDA C

NVIDIA S VISION FOR EXASCALE. Cyril Zeller, Director, Developer Technology

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

CUDA Accelerated Compute Libraries. M. Naumov

NVIDIA Update and Directions on GPU Acceleration for Earth System Models

Stan Posey, NVIDIA, Santa Clara, CA, USA

CUDA libraries. Lecture 5: libraries and tools. CUDA libraries. CUDA libraries

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

A Sampling of CUDA Libraries Michael Garland

NVIDIA FOR DEEP LEARNING. Bill Veenhuis

GPU ARCHITECTURE Chris Schultz, June 2017

CUDA Toolkit 4.0 Performance Report. June, 2011

Steve Scott, Tesla CTO SC 11 November 15, 2011

IBM Deep Learning Solutions

Making Supercomputing More Available and Accessible Windows HPC Server 2008 R2 Beta 2 Microsoft High Performance Computing April, 2010

CUDA 6.0 Performance Report. April 2014

Deploying Deep Learning Networks to Embedded GPUs and CPUs

Scaling in a Heterogeneous Environment with GPUs: GPU Architecture, Concepts, and Strategies

NVIDIA DLI HANDS-ON TRAINING COURSE CATALOG

Transcription:

HPC with the NVIDIA Accelerated Computing Toolkit Mark Harris, November 16, 2015

Accelerators Surge in World s Top Supercomputers 125 100 75 Top500: # of Accelerated Supercomputers 100+ accelerated systems now on Top500 list 1/3 of total FLOPS powered by accelerators 50 NVIDIA Tesla GPUs sweep 23 of 24 new accelerated supercomputers 25 Tesla supercomputers growing at 50% CAGR over past five years 0 2013 2014 2015 2

70% OF TOP HPC APPS NOW ACCELERATED INTERSECT360 SURVEY OF TOP APPS KEY MATERIALS APP NOW ACCELERATED Typically Consumes 10-25% of HPC System Top 10 HPC Apps 90% Accelerated Top 50 HPC Apps 70% Accelerated 1 Dual K80 Server 1.3x 4 Dual CPU Servers 1.0x Intersect360, Nov 2015 HPC Application Support for GPU Computing Dual-socket Xeon E5-2690 v2 3GHz, Dual Tesla K80, FDR InfiniBand Dataset: NiAl-MD Blocked Davidson NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. 3

Accelerated Computing Available Everywhere NVIDIA Accelerated Computing Toolkit Supercomputer Datacenter - Cloud Workstation - Notebook Embedded - Automotive 4

NVIDIA Accelerated Computing Toolkit Everything You Need to Accelerate Your Applications FFTW BLAS DNN C/C++ Fortran OpenACC FFT BLAS DNN C++ OpenACC Libraries Language Solutions NSight IDE Visual Profiler Unified Memory GPUDirect Hands-on Labs Docs & Samples NSight Tools System Interfaces Training developer.nvidia.com 5

Get Started Quickly Machine Learning Tools OpenACC Toolkit CUDA Toolkit Design, Train and Deploy Deep Neural Networks Simple way to accelerate scientific code High-Performance C/C++ Applications developer.nvidia.com/deep-learning developer.nvidia.com/openacc developer.nvidia.com/cuda-toolkit 6

Why is Machine Learning Hot Now? Big Data Availability New Techniques GPU Acceleration 350 millions images uploaded per day 2.5 Petabytes of customer data hourly 300 hours of video uploaded every minute 7

cudnn Deep Learning Primitives Igniting Artificial Intelligence Drop-in Acceleration for major Deep Learning frameworks: Caffe, Theano, Torch Up to 2x faster on AlexNet than baseline Caffe GPU implementation 80 60 40 20 Millions of images trained per day Applications Frameworks cudnn NVIDIA GPU Train up to 2X faster than cudnn 2 2.5x 2.0x 1.5x 1.0x 0.5x 0 cudnn 1 cudnn 2 cudnn 3 0.0x Alexnet OverFeat VGG 8

Tesla M40 World s Fastest Accelerator for Deep Learning 8x Faster Caffe Performance CUDA Cores 3072 CPU Peak SP 7 TFLOPS GDDR5 Memory 12 GB Tesla M40 Reduce Training Time from 10 Days to 1 Day Bandwidth 288 GB/s 0 1 2 3 4 5 6 7 8 9 10 11 Power 250W # of Days Caffe Benchmark: AlexNet training throughput based on 20 iterations, CPU: E5-2680v3 @ 2.70GHz. 64GB System Memory, CentOS 6.2 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. 9

NVIDIA DIGITS TM Interactive Deep Learning GPU Training System Data Scientists and Researchers: Quickly design the best deep neural network (DNN) for your data Visually monitor DNN training quality in real-time Manage training of many DNNs in parallel on multi-gpu systems developer.nvidia.com/digits 11

Speedup vs. Single CPU Thread OpenACC main() { <serial code> #pragma acc kernels //automatically run on GPU { <parallel code> } } University of Illinois PowerGrid- MRI Reconstruction 70x Speed-Up 2 Days of Effort Simple Powerful Portable RIKEN Japan NICAM- Climate Modeling 12x Performance Portability with PGI 30x Fueling the Next Wave of Scientific Discoveries in HPC 10x 8x 6x 4x CPU: MPI + OpenMP CPU: MPI + OpenACC CPU + GPU: MPI + OpenACC 7-8x Speed-Up 5% of Code Modified 2x 0x minighost (Mantevo) NEMO (Climate & Ocean) CLOVERLEAF (Physics) http://www.cray.com/sites/default/files/resources/openacc_213462.12_openacc_cosmo_cs_fnl.pdf http://www.hpcwire.com/off-the-wire/first-round-of-2015-hackathons-gets-underway http://on-demand.gputechconf.com/gtc/2015/presentation/s5297-hisashi-yashiro.pdf http://www.openacc.org/content/experiences-porting-molecular-dynamics-code-gpus-cray-xk7 359.miniGhost: CPU: Intel Xeon E5-2698 v3, 2 sockets, 32-cores total, GPU: Tesla K80 (single GPU) NEMO: Each socket CPU: Intel Xeon E5-2698 v3, 16 cores; GPU: NVIDIA K80 both GPUs 12 CLOVERLEAF: CPU: Dual socket Intel Xeon CPU E5-2690 v2, 20 cores total, GPU: Tesla K80 both GPUs

Speedup vs CPU LS-DALTON Large-scale Application for Calculating High-accuracy Molecular Energies Lines of Code Modified Minimal Effort # of Weeks Required # of Codes to Maintain <100 Lines 1 Week 1 Source Big Performance LS-DALTON CCSD(T) Module Benchmarked on Titan Supercomputer (AMD CPU vs Tesla K20X) 12.0x OpenACC makes GPU computing approachable for domain scientists. Initial OpenACC implementation required only minor effort, and more importantly, no modifications of our existing CPU implementation. 8.0x 4.0x Janus Juul Eriksen, PhD Fellow qleap Center for Theoretical Chemistry, Aarhus University 0.0x Alanine-1 13 Atoms Alanine-2 23 Atoms Alanine-3 33 Atoms 13

The NVIDIA OpenACC Toolkit Free Toolkit Offers Simple & Powerful Path to Accelerated Computing PGI Compiler Free OpenACC compiler for academia NVProf Profiler Easily find where to add compiler directives Code Samples Learn from examples of real-world algorithms Documentation Quick start guide, Best practices, Forums Download at developer.nvidia.com/openacc 14

CUDA Toolkit Flexibility High Performance LIBRARIES PARALLEL C++ DEVELOPER TOOLS global void xyzw_frequency(int *count, char *text, int n) { const char letters[] { 'x','y','z','w' }; count_if(count, text, n, [&](char c) { for (const auto x : letters) if (c == x) return true; return false; }); } Drop-In Acceleration Parallel Algorithms and Custom Parallel C++ Pinpoint Bugs and Performance Opportunities Download at developer.nvidia.com/cuda-toolkit 15

GPU Accelerated Libraries Drop-in Acceleration for Your Applications Linear Algebra & Numerical BLAS BLAS BLAS cublas cusparse cusolver BLAS AmgX NVIDIA Math Lib curand Parallel Algorithms Machine Learning GPU AI Path Finding Signal Processing FFT, Image & Video cufft developer.nvidia.com/gpu-accelerated-libraries NVENC Video Encode NVIDIA NPP BLAS 16

Developer Tools Pinpoint Performance Bottlenecks and Hard-to-Find Bugs NVIDIA Visual Profiler Analyze GPU Performance Develop, Debug, & Profile in Eclipse or Visual Studio CUDA-GDB Standard Linux Debugger CUDA-MEMCHECK Find memory errors, leaks and race conditions 17

Your Next Steps Many Learning Opportunities Free Hands-On Self-Paced Labs in the NVIDIA Booth Sign up for credit for on-line labs Free In-Depth On-Line Courses Deep Learning: developer.nvidia.com/deep-learning-courses APRIL 4-7, 2016 SILICON VALLEY OpenACC: developer.nvidia.com/openacc-course Learn from hundreds of experts at GTC 2016 All about Accelerated Computing: developer.nvidia.com 18

Backup 19

THRUST: Productive Parallel C++ High-level parallel algorithms Resembles C++ STL Included in CUDA Toolkit Productive High-level API CPU/GPU Performance portability CUDA, OpenMP, and TBB backends Flexible, extensible // generate 32M random numbers on host thrust::host_vector<int> h_vec(32 << 20); thrust::generate(h_vec.begin(), h_vec.end(), rand); // transfer data to device (GPU) thrust::device_vector<int> d_vec = h_vec; // sort data on device thrust::sort(d_vec.begin(), d_vec.end()); // transfer data back to host thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin()); NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. 20

C++11 in CUDA C++11 Feels like a new language Bjarne Stroustrup Count the number of occurrences of letters x, y, z and w in text global void xyzw_frequency(int *count, char *text, int n) { const char letters[] { 'x','y','z','w' }; } count_if(count, text, n, [&](char c) { for (const auto x : letters) if (c == x) return true; return false; }); Initializer List Lambda Function Range-based For Loop Automatic type deduction Output: Read 3288846 bytes from "warandpeace.txt" counted 107310 instances of 'x', 'y', 'z', or 'w' in "warandpeace.txt" 21

Parallel STL: Algorithms + Policies Higher-level abstractions for ease of use and performance portability int main() { size_t n = 1 << 16; std::vector<float> x(n, 1), y(n, 2), z(n); float a = 13; auto I = interval(0, n); std::for_each(std::par, std::begin(i), std::end(i), [&](int i) { z[i] = a * x[i] + y[i]; }); On track for C++17 Standard Includes algorithms such as for_each, reduce, transform, inclusive_scan Execution policies: std::seq std::par std::par_vec } return 0; 22

Instruction-Level Profiling Pinpoint Performance Problems New in CUDA 7.5 Visual Profiler Highlights hot spots in code, helping you target bottlenecks Quickly locate expensive code lines Detailed per-line stall analysis helps you understand underlying limiters 23

Unified Memory Dramatically Lower Developer Effort Past Developer View Developer View With Unified Memory System Memory GPU Memory Unified Memory 24

Simplified Memory Management Code void sortfile(file *fp, int N) { char *data; data = (char *)malloc(n); fread(data, 1, N, fp); qsort(data, N, 1, compare); use_data(data); CPU Code CUDA 6 Code with Unified Memory void sortfile(file *fp, int N) { char *data; cudamallocmanaged(&data, N); fread(data, 1, N, fp); qsort<<<...>>>(data,n,1,compare); cudadevicesynchronize(); use_data(data); } free(data); } cudafree(data); NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. 25

Million elements per second Great Performance with Unified Memory RAJA: Portable Parallel C++ Framework RAJA uses Unified Memory for heterogeneous array allocations Parallel forall loops run on CPU or GPU Excellent performance considering this is a "generic version of LULESH with no architecture-specific tuning. - Jeff Keasler, LLNL 20 18 16 14 12 10 8 6 4 2 0 CPU: 10-core Haswell GPU: Tesla K40 1.5x LULESH Throughput 1.9x 2.0x 45^3 100^3 150^3 Mesh size GPU: NVIDIA Tesla K40, CPU: Intel Haswell E5-2650 v3 @ 2.30GHz, single socket 10-core 26