Future Directions for CUDA Presented by Robert Strzodka

Size: px

Start display at page:

Download "Future Directions for CUDA Presented by Robert Strzodka"

Shauna Ford
6 years ago
Views:

1 Future Directions for CUDA Presented by Robert Strzodka Authored by Mark Harris NVIDIA Corporation

2 Platform for Parallel Computing Platform The CUDA Platform is a foundation that supports a diverse parallel computing ecosystem.

Platform for Parallel Computing Compiler Tool Chain Programming Languages NVCC 1.0 2.0 3.0 4.0 5.

3 Platform for Parallel Computing Compiler Tool Chain Programming Languages NVCC C Templates Inheritance Recursion Function pointers C++ Fortran (PGI) UVA Device Code LLVM Linking C++ new/delete Virtual functions OpenACC Dynamic Parallelism Platform Libraries Developer Tools cublas cufft Command- Line Profiler cuda-gdb Visual Profiler NVPP GPUDirect cusparse curand cuda-memcheck nvidia-smi Thrust GPU-Aware MPI new NVPP functions Nsight IDE New Visual Profiler cublas Device API Nsight Eclipse Ed. Detect Shared Memory Hazards

4 Investing in the Future Programming Model Enabling More Programmers Future Computing Platforms Platform <<<...>>>

5 serial_sort(sorted, data, N); parallel_sort <<<... >>> (sorted, data, N); Unified Programming Language

Unified Run-Time Interface int main() { float

$B(float *data) { do_stuff(data); X <<<.$ .. >>> (data); Y <<<... >>> (data); Z <<<.

6 Unified Run-Time Interface int main() { float *data; setup(data); } A <<<... >>> (data); B <<<... >>> (data); C <<<... >>> (data); cudadevicesynchronize(); return 0; CUDA Dynamic Parallelism global void B(float *data) { do_stuff(data); X <<<... >>> (data); Y <<<... >>> (data); Z <<<... >>> (data); cudadevicesynchronize(); CPU main A B C GPU X Y Z } do_more_stuff(data);

$void sortfile(file *fp, int N) { char *data = (char*)malloc(n); char *sorted = (char*)malloc(n); fread(data, 1, N, fp); char *d_data, *d_sorted; cudamalloc(&d_data, N); cudamalloc(&d_sorted, N);$

7 void sortfile(file *fp, int N) { char *data = (char*)malloc(n); char *sorted = (char*)malloc(n); fread(data, 1, N, fp); char *d_data, *d_sorted; cudamalloc(&d_data, N); cudamalloc(&d_sorted, N); cudamemcpy(d_data, data, N,...); parallel_sort<<<... >>>(d_sorted, d_data, N); cudamemcpy(sorted, d_sorted, N,...); cudafree(d_data); cudafree(d_sorted); } use_data(sorted); free(data); free(sorted);

8 void sortfile(file *fp, int N) { char *data = (char*)malloc(n); char *sorted = (char*)malloc(n); fread(data, 1, N, fp); Unified Virtual Memory parallel_sort<<<... >>>(sorted, data, N); } use_data(sorted); free(data); free(sorted);

9 DP GFLOPS per Watt Simpler, More Integrated Programming Maxwell Unified Virtual Memory 8 6 Kepler 4 2 Tesla Fermi Unified Run-Time 2008 Unified Language

10 Diversity of Programming Languages

11 Enabling More Programming Languages Developers want to build front-ends for Python, Java, R, DSLs CUDA C, C++, Fortran LLVM Compiler For CUDA New Language Support Target other processors like ARM, FPGAs, GPUs, x86 NVIDIA GPUs x86 CPUs New Processor Support

12 Enabling More Programming Languages CUDA C, C++, Fortran New Language Support LLVM Compiler For CUDA NVIDIA GPUs x86 CPUs New Processor Support Mozilla Rust Halide (

13 Rapid Development Powerful Libraries Large Community Commercial Support

14 Is Python Fast Enough for HPC? Python apps often implement performance critical functions in C/C++.

15 Compile Python for Parallel Architectures Anaconda Accelerate from Continuum Analytics NumbaPro array-oriented compiler for Python & NumPy Compile for CPUs or GPUs (uses LLVM + NVIDIA Compiler SDK) Fast Development + Fast Execution: Ideal Combination Free Academic License

(zr*zr+zi*zi) >= 4: return i return 255 CUDA Programming, Python Syntax @cuda.

xmax ymin, ymax, iters): x, y = cuda.grid(2) if x < img.shape[0] and y < img.

shape[0]), min_y+y*((max_y-min_y)/img.shape[1]), iters) gimage = np.

16 CUDA argtypes=[f8, f8, uint32], device=true) def mandel(x, y, max_iters): zr, zi = 0.0, 0.0 for i in range(max_iters): newzr = (zr*zr-zi*zi)+x zi = 2*zr*zi+y zr = newzr if (zr*zr+zi*zi) >= 4: return i return 255 CUDA Programming, Python f8, f8, f8, f8, uint32]) def mandel_kernel(img, xmin, xmax ymin, ymax, iters): x, y = cuda.grid(2) if x < img.shape[0] and y < img.shape[1]: img[y, x] = mandel(min_x+x*((max_x-min_x)/img.shape[0]), min_y+y*((max_y-min_y)/img.shape[1]), iters) gimage = np.zeros((1024, 1024), dtype = np.uint8) d_image = cuda.to_device(gimage) mandel_kernel[(32,32), (32,32)](d_image, -2.0, 1.0, -1.0, 1.0, 20) d_image.to_host() Mandelbrot Time Speedup v. Pure Python Pure Python 4.85s -- NumbaPro (CPU) 0.11s 44x CUDA Python (K20).004s 1221x

18 KAYLA

19 Kayla Development Platform CUDA 5 OpenGL 4.3 Kick starts ARM + CUDA Ecosystem NAMD Ported in 2 Days Quad ARM + Kepler GPU Quad ARM + Any CUDA GPU

20 Platform for Parallel Computing Platform Compiler Tool Chain Programming Languages Libraries Developer Tools NVCC C cublas cufft Command- Line Profiler cuda-gdb Visual Profiler Templates NVPP C++ Fortran (PGI) GPUDirect Inheritance Recursion Function pointers cusparse curand cuda-memcheck nvidia-smi UVA Thrust Device Code LLVM Linking C++ new/delete Virtual functions GPU-Aware MPI new NVPP functions Nsight IDE New Visual Profiler OpenACC Dynamic Parallelism cublas Device API Nsight Eclipse Ed. Detect Shared Memory Hazards

21 Platform for Parallel Computing Compiler Tool Chain JIT Linking JIT Compilation Programming Languages 5.0 C++11 Platform Libraries ARM Support Sparse Solvers Multi-GPU Support Developer Tools Profiler Step-by-Step Guidance Single-GPU Debugging

Hybrid operating system Enablement Parallel Compiler

22 Future Challenges Today Easier Parallel Programming Optimizing locality and computation Task, Thread & Data Parallelism Hybrid operating system Enablement Parallel Compiler Foundation Enablement Ubiquitous parallel programming Power Aware Programming

23 GPUs Everywhere MPI

24 Thank you

ADVANCES IN EXTREME-SCALE APPLICATIONS ON GPU. Peng Wang HPC Developer Technology

ADVANCES IN EXTREME-SCALE APPLICATIONS ON GPU Peng Wang HPC Developer Technology NVIDIA SuperPhones to SuperComputers Computers no longer get faster, just wider Architectural Features Common to All Processors