NEW FEATURES IN CUDA 6 MAKE GPU ACCELERATION EASIER MARK HARRIS

Size: px

Start display at page:

Download "NEW FEATURES IN CUDA 6 MAKE GPU ACCELERATION EASIER MARK HARRIS"

Audrey Richardson
6 years ago
Views:

1 NEW FEATURES IN CUDA 6 MAKE GPU ACCELERATION EASIER MARK HARRIS

2 1 Unified Memory CUDA XT and Drop-in Libraries GPUDirect RDMA in MPI 4 Developer Tools

3 1 Unified Memory CUDA XT and Drop-in Libraries GPUDirect RDMA in MPI 4 Developer Tools

4 Unified Memory Dramatically Lower Developer Effort Developer View Today Developer View With Unified Memory System Memory GPU Memory Unified Memory

Super Simplified Memory Management Code void sortfile(file

cudamallocmanaged(&, N); fread(, 1, N, fp); qsort<<<.

5 Super Simplified Memory Management Code void sortfile(file *fp, int N) { char *; = (char *)malloc(n); fread(, 1, N, fp); qsort(, N, 1, compare); use_(); CPU Code CUDA 6 Code with Unified Memory void sortfile(file *fp, int N) { char *; cudamallocmanaged(&, N); fread(, 1, N, fp); qsort<<<...>>>(,n,1,compare); cudadevicesynchronize(); use_(); } free(); } cudafree();

6 Unified Memory Delivers 1. Simpler Programming & Memory Model Single pointer to, accessible anywhere Tight language integration Greatly simplifies code porting 2. Performance Through Data Locality Migrate to accessing processor Guarantee global coherency Still allows cudamemcpyasync() hand tuning

7 Simpler Memory Model: Eliminate Deep Copies CPU Memory struct Elem { int prop1; int prop2; char *text; }; Elem prop1 prop2 *text Hello World GPU Memory

8 Simpler Memory Model: Eliminate Deep Copies CPU Memory struct Elem { int prop1; int prop2; char *text; }; Elem prop1 prop2 *text Elem prop1 prop2 *text GPU Memory Hello World Two Copies Required Hello World

9 Simpler Memory Model: Eliminate Deep Copies void launch(elem *elem) { Elem *g_elem; char *g_text; int textlen = strlen(elem->text); // Allocate storage for struct and text cudamalloc(&g_elem, sizeof(elem)); cudamalloc(&g_text, textlen); CPU Memory Elem prop1 prop2 Hello World *text } // Copy up each piece separately, including // new text pointer value cudamemcpy(g_elem, elem, sizeof(elem)); cudamemcpy(g_text, elem->text, textlen); cudamemcpy(&(g_elem->text), &g_text, sizeof(g_text)); // Finally we can launch our kernel, but // CPU & GPU use different copies of elem kernel<<<... >>>(g_elem); Elem prop1 prop2 *text Two Copies Required GPU Memory Hello World

10 Simpler Memory Model: Eliminate Deep Copies CPU Memory void launch(elem *elem) { kernel<<<... >>>(elem); } Unified Memory Elem prop1 prop2 *text Hello World GPU Memory

11 Simpler Memory Model Example: GPU & CPU Shared Linked Lists CPU Memory key key key key next next next next GPU Memory

12 Simpler Memory Model Example: GPU & CPU Shared Linked Lists CPU Memory Only practical option is to use zero-copy (pinned system) memory GPU accesses at PCIe bandwidth GPU accesses at very high latency key next key next All access over PCIe key next key next GPU Memory

13 Simpler Memory Model Example: GPU & CPU Shared Linked Lists CPU Memory Can pass list elements between Host & Device Can insert and delete elements from Host or Device* key Unified Memory key Local access key key Single list - no complex synchronization next next next next *Program must still ensure no race conditions. *Data is coherent between CPU & GPU at kernel launch & sync only Local access GPU Memory

$pass-by-reference: JUST WORKS class Managed { void *operator new(size_t len) { void *ptr; cudamallocmanaged(&ptr, len); return ptr; } CPU Program Elem * = new Elem; Unified Memory }; void operator$

14 Unified Memory with C++ A Powerful Combination C++ objects migrate easily when allocated on managed heap Overload new operator to use C++ in unified memory region Deep copies, pass-by-value, pass-by-reference: JUST WORKS class Managed { void *operator new(size_t len) { void *ptr; cudamallocmanaged(&ptr, len); return ptr; } CPU Program Elem * = new Elem; Unified Memory }; void operator delete(void *ptr) { cudafree(ptr); } // Inherit from Managed, // C++ now handles our deep copies class Elem : public Managed { int prop1; int prop2; String text; }; Elem prop1 prop2 text Hello World GPU Program

Unified Memory Roadmap CUDA 6: Ease of Use Next:

Memcopy Required Coherence @ launch & sync

Migration Hints Additional OS Support System

15 Unified Memory Roadmap CUDA 6: Ease of Use Next: Optimizations Single Pointer to Data Maxwell No Memcopy Required launch & sync Shared C/C++ Data Structures Prefetching Migration Hints Additional OS Support System Allocator Unified Stack Memory Unified HW-Accelerated Coherence

16 1 Unified Memory CUDA XT and Drop-in Libraries GPUDirect RDMA in MPI 4 Developer Tools

17 Extended (XT) Library Interfaces Automatic Scaling to >1 GPU per node cufft and cublas level 3 Out-of-core operations: e.g. very large GEMM BLAS 3 Host Interfaces: automatically overlaps memory transfers

18 Execution Time (ms) Multi-GPU cufft Single & Batch Transforms across multiple GPUs (max 2 in CUDA 6) Tuned for multi-gpu cards (K10) Better scaling for larger transforms cufft 3D Performance on 2 GPUs* K10: Single GPU K10: Dual GPU x256x x512x512 *Does not include memcpy time

19 GFLOPS Multi-GPU cublas Single function call automatically spreads work across two GPUs Source and result in system memory Supports matrices > size of memory (out-of-core) All BLAS Level-3 routines cublas ZGEMM Performance on 2 GPUS in-core limit Matrix Size (NxN) 1 K20c 2 K20c

20 New Drop-in NVBLAS Library Drop-in replacement for CPU-only BLAS Automatically routes standard BLAS3 calls to cublas Optionally configure which routines and matrix sizes are accelerated User provides CPU-only BLAS dynamic library location Simply re-link or change library load order gcc myapp.c lnvblas -lmkl_rt -o myapp - or - env LD_PRELOAD=libnvblas.so myapp

21 fp64 GFlops/s New Drop-in NVBLAS Library Drop-in replacement for CPU-only BLAS Automatically route BLAS3 calls to cublas 3000 Matrix-Matrix Multiplication in R Example: Drop-in Speedup for R 2500 > LD_PRELOAD=/usr/local/cuda/lib64/libnvblas.so R > A <- matrix(rnorm(4096*4096), nrow=4096, ncol=4096) > B <- matrix(rnorm(4096*4096), nrow=4096, ncol=4096) > system.time(c <- A %*% B) user system elapsed nvblas, 4x K20X GPUs MKL, 6-core Xeon E CPU Use in any app that uses standard BLAS3 Octave, Scilab, etc matrix dimension

22 1 Unified Memory CUDA XT and Drop-in Libraries GPUDirect RDMA in MPI 4 Developer Tools

with GPU-Direct-RDMA Execution Time of HSG Application with 2 GPU Nodes 90 80 70 60 4 MPI Processes / GPU Node MVAPICH2-1.9 MVAPICH2-1.

23 Total Time (S.) Better GPUDirect RDMA in MVAPICH2 & OpenMPI Reduced inter-node latency Better MPI Application Scaling Preliminary Performance of MVAPICH2 with GPU-Direct-RDMA Execution Time of HSG Application with 2 GPU Nodes MPI Processes / GPU Node MVAPICH2-1.9 MVAPICH2-1.9-GDR 36% % % 63% 10 Visit Mellanox booth #2722 for a Demo Side Number Based on MVAPICH2-1.9 Intel Sandy Bridge (E5-2670) node with 16 cores NVIDIA Telsa K20c GPU, Mellanox ConnectX-3 FDR HCA CUDA 5.0, OFED with GPU-Direct-RDMA Patch

24 1 Unified Memory CUDA XT and Drop-in Libraries GPUDirect RDMA in MPI 4 Developer Tools

25 Remote Development with Nsight Eclipse Edition Local IDE, remote application Edit locally, build & run remotely Automatic sync via ssh Cross-compilation to ARM Full debugging & profiling via remote connection Edit sync Build Run Debug Profile

26 CUDA tools for MPS (Multi-Process Server) Profile MPI apps on MPS using nvprof Import multi-process MPI ranks into Visual Profiler Run CUDA- MEMCHECK on apps running on MPS

27 Detailed Kernel Profiling Visual Profiler and NSight EE Instruction counts automatically locate hot spots in your code Corresponding Assembly Detected Hot Spot

28 Detailed Instruction Mix Visualization Visual Profiler and NSight EE

29 CUDA 6 Dramatically Simplifies Parallel Programming with Unified Memory Sign up for CUDA Registered Developer Program

Unified memory. GPGPU 2015: High Performance Computing with CUDA University of Cape Town (South Africa), April, 20th-24th, 2015

Unified memory. GPGPU 2015: High Performance Computing with CUDA University of Cape Town (South Africa), April, 20th-24th, 2015 Unified memory GPGPU 2015: High Performance Computing with CUDA University of Cape Town (South Africa), April, 20th-24th, 2015 Manuel Ujaldón Associate Professor @ Univ. of Malaga (Spain) Conjoint Senior