NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Size: px

Start display at page:

Download "NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU"

Marybeth Wilkerson
6 years ago
Views:

1 NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

2 GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated computing. 2

3 Modern HPC Node Architecture X86 CPU GPU Accelerator $ $ $ $ $ $ $ $ $ $ $ $ Latency-optimized CPUs Shared Cache Throughput-optimized Accelerators Exposed memory hierarchies High Capacity Memory $ $ $ $ $ $ $ $ Shared Cache PCIe 3 High Bandwidth Memory 3

4 The Changing HPC System Landscape HPC Compilers Must Evolve to Support Heterogeneous Systems Intel Xeon ARM64 + Tesla POWER Intel Knights Landing Top 500 in 2008: 90% x86-64 CPUs AMD APU NVIDIA Pascal

5 Modern CV Pipeline 5

6 OpenVX Framework Enables Mobile/Embedded CV Graph Scheduling Memory management Kernel Merge Data Tiling Split the graph execution across the whole system : CPU / GPU / dedicated HW Reuse memory for different intermediate data Replace a subgraph by a single faster node Execute a subgraph at tile granularity instead of image granularity Faster execution or lower power consumption Less allocation overhead, more memory for other applications Better memory locality, less kernel launch overhead Better use of data cache and local memory 6

7 Architecture Support for CV and ML Subsampled SLIC (S-SLIC) Mapping DNN Computations Patch Memory System (PMEM) 7

8 Accelerated computing toolkit 8

9 Programming Heterogeneous Systems Desire: homogeneous program, heterogeneous execution Want programming systems designed for heterogeneity Want unified programming models that include parallelism Want to design scalable programs for parallelism Hybrid parallel Program serial Homogeneous Programming Model Single Program parallel + serial GPU CPU GPU CPU Today Ideal NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. 9

10 Unified Memory Dramatically Lower Developer Effort Past Developer View Developer View With Unified Memory System Memory GPU Memory Unified Memory NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. 10

11 Meta-Data on CUDA Allocators Each memory pool has its own meta-data A free operation on the GPU for memory allocated on the CPU would required modifying the CPU meta-data: Inter-processor synchronization (e.g., locks) Remote memory accesses GPU Memory GPU Driver Application GPU Allocator Application CPU GPU 12

12 Million elements per second Great Performance with Unified Memory RAJA: Portable C++ Framework for parallel-for style programming RAJA uses Unified Memory for heterogeneous array allocations Parallel forall loops run on device Excellent performance considering this is a "generic version of LULESH with no architecture-specific tuning. - Jeff Keasler, LLNL CPU: 10-core Haswell GPU: Tesla K40 LULESH Throughput 1.9x 2.0x x ^3 100^3 150^3 Mesh size GPU: NVIDIA Tesla K40, CPU: Intel Haswell E GHz, single socket 10-core 14

Zones / Second OpenACC + Unified Memory Productivity and Performance on Kepler 14000 GPU Acceleration of LULESH with OpenACC 12000 10000 8000 6000 4000 2000 0 CPU OpenMP 2.

13 Zones / Second OpenACC + Unified Memory Productivity and Performance on Kepler GPU Acceleration of LULESH with OpenACC CPU OpenMP 2.0 GPU OpenACC + Unified Memory GPU OpenACC + Data Directives Days of Work Weeks of Work 15

$x OpenACC #if defined(cpu) #pragma omp parallel for simd #elif defined(mic) #pragma omp target teams distribute \ parallel for simd #elif defined(omp_gpu) #pragma omp target teams distribute \$

14 OpenMP vs OpenACC? Performance Portability = Maintainability OpenMP 4.x OpenACC #if defined(cpu) #pragma omp parallel for simd #elif defined(mic) #pragma omp target teams distribute \ parallel for simd #elif defined(omp_gpu) #pragma omp target teams distribute \ parallel for schedule(static,1) #elif defined(something_else) #pragma omp target... #endif for(int i = 0; i < N; i++) #pragma acc kernels loop for(int i = 0; i < N; i++) 16

True parallel loops Dynamic scalability,

SW-managed caches What does OpenMP have that

15 OpenACC or OpenMP? What does OpenACC have that OpenMP does not? True parallel loops Dynamic scalability, Descriptive parallelism KERNELS directive semi-automatic parallelization CACHE directive important for exploiting SW-managed caches What does OpenMP have that OpenACC does not? SINGLE, SECTIONS MASTER, CRITICAL, BARRIER, ORDERED TASKs, CANCEL locks 17

16 19

17 Parallel STL: Algorithms + Policies Higher-level abstractions for ease of use and performance portability int main() { size_t n = 1 << 16; std::vector<float> x(n, 1), y(n, 2), z(n); float a = 13; auto I = interval(0, n); std::for_each(std::par, std::begin(i), std::end(i), [&](int i) { z[i] = a * x[i] + y[i]; }); On track for C++17 Standard Includes algorithms such as for_each, reduce, transform, inclusive_scan Execution policies: std::seq std::par std::par_vec } return 0; NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. 21

18 Other Accelerated computing toolkit with CUDA 22

19 MAPS MAPS is an open-source, header-only C++ CUDA template library for automatic multi- GPU programming and optimization of GPU kernels. The framework leverages memory access patterns to provide near-optimal performance on various architectures. Using MAPS:- Automatically produces optimized GPU code - Separates complex indexing and shared memory optimizations from the algorithm - Transparently manages multi-gpu memory segmentation and inter-gpu communication - Provides familiar STL-based interfaces (containers and iterators) - Results in short, intelligible code 23

20 Legion Legion is a data-centric parallel programming system for writing portable high performance programs targeted at distributed heterogeneous architectures. 24

21 GPUCC: Google's Open-Source CUDA Compiler gpucc, an open-source compiler built by Google targeting CUDA and NVIDIA GPUs. gpucc performs various general and CUDA-specific optimizations to generate high performance code. It outperforms NVIDIA's toolchain (nvcc) on internal large-scale end-to-end benchmarks by up to 51%, and is on par for several open-source benchmarks (Rodinia, SHOC and Tensor). It supports modern language features such as those in C++11 and C++14, and compiles code 8% faster than nvcc, up to 2.4 faster for pathological compiles. 25

HPC with the NVIDIA Accelerated Computing Toolkit Mark Harris, November 16, 2015

HPC with the NVIDIA Accelerated Computing Toolkit Mark Harris, November 16, 2015 Accelerators Surge in World s Top Supercomputers 125 100 75 Top500: # of Accelerated Supercomputers 100+ accelerated systems