Compiling CUDA and Other Languages for GPUs. Vinod Grover and Yuan Lin

Size: px

Start display at page:

Download "Compiling CUDA and Other Languages for GPUs. Vinod Grover and Yuan Lin"

Reynard Arnold
6 years ago
Views:

1 Compiling CUDA and Other Languages for GPUs Vinod Grover and Yuan Lin

2 Agenda Vision Compiler Architecture Scenarios SDK Components Roadmap Deep Dive SDK Samples Demos

Vision Build a platform for GPU computing around

researchers, and hobbyists Create a flourishing

3 Vision Build a platform for GPU computing around foundations of CUDA. Bring other languages to GPUs Enable CUDA for other platforms Make that platform available for ISVs, researchers, and hobbyists Create a flourishing eco-system NVIDIA GPUs CUDA C, C++ LLVM Compiler For CUDA x86 CPUs New Language Support New Processor Support

4 Compiler Architecture

5 CUDA Compiler Architecture (4.1, 4.2 and 5.0) NVCC.cu CUDA FE LLVM Optimizer NVPTX CodeGen PTX PTXAS Host Compiler CUDA Runtime CUDA Driver

6 Open Compiler Architecture NVCC.cu CUDA FE libcuda.lang NVVM IR LLVM Optimizer NVPTX CodeGen libnvvm Open Sourced PTX PTXAS Host Compiler CUDA Runtime CUDA Driver

7 Scenarios

8 Building Production Quality Compilers NVCC libcuda.lang CUDA Fortran OpenACC libnvvm CUDA Runtime

9 Building Domain Specific Languages NVCC libcuda.lang DSL Front End libnvvm DSL Runtime CUDA Runtime

10 Enabling Other Platforms NVCC libcuda.lang x86 LLVM Backend libnvvm x86 CUDA Runtime CUDA Runtime

11 Enabling Research in GPU Computing CU++ CLANG x86 LLVM Backend libnvvm Custom Runtime

12 Compiler SDK Components Open source NVPTX backend NVVM IR specification libnvvm libcuda.lang

13 Enabling open source GPU compilers Contribute NVPTX code generator sources back to LLVM trunk Sources available directly from LLVM! Standard LLVM license Invite community to: Use it! Contribute improvements to it!

14 NVVM IR An intermediate representation based on LLVM and targeted for GPU computing IR Specification describes: Address spaces Kernels and device functions Intrinsics more

15 libnvvm A binary component library for Windows, Linux and MacOS targeting PTX 3.1 Fermi, Kepler and later architectures 32-bit and 64-bit hosts NVVM IR verifier API document Sample applications

16 libcuda.lang A binary library that takes CUDA source program and generates NVVM IR and host C++ program for supported platforms API document Sample applications

17 Roadmap

18 Availability Open source NVPTX backend May 2012 NVVM IR spec, libnvvm, samples Preview release (May 2012) Production release (Next release after CUDA 5.0) libcuda.lang To be announced

19 Deep Dive

20 Deep Dive NVVM IR libnvvm SDK Samples

21 NVVM IR Designed to represent GPU kernel functions and device functions Represents the code executed by each CUDA thread NVVM IR Specification 1.0

22 NVVM IR and LLVM IR NVVM IR Based on LLVM IR With a set of rules and intrinsics No new types. No new operators. No new reserved words. An NVVM IR program can work with any standard LLVM IR tool llvm-as llvm-link llvm-extract llvm-dis llvm-ar An NVVM IR program can be built with the standard LLVM distribution. svn co llvm

23 NVVM IR

24 NVVM IR: Address Spaces

25 NVVM IR: Address Spaces CUDA C/C++ Address space is a storage qualifier. A pointers is generic pointer, which can point to any address space. global int g; shared int s; constant int c; void foo(int a) { int l; int *p ; switch (a) { case 1: p = &g; case 2: p = &s; case 3: p = &c; case 4: p = &l; } }

26 NVVM IR: Address Spaces CUDA C/C++ Address space is a storage qualifier. A pointers is generic pointer, which can point to any address space. OpenCL C Address space is part of the type system. A pointer type must be qualified with an address space. global int g; constant int c; foo(local int *ps) { int l; int *p ; p = l; global int *pg = &g; constant int *pc = &c; }

27 NVVM IR: Address Spaces CUDA C/C++ Address space is a storage qualifier. A pointers is generic pointer, which can point to any address space. OpenCL C Address space is part of the type system. A pointer type must be qualified with an address space. NVVM IR Support both in the same program.

28 NVVM IR: Address Spaces Define address space numbers Allow generic pointers and specific pointers Provide intrinsics to perform conversions between generic pointers and specific pointers

29 NVVM IR: GPU Program Properties Properties: Maximum/minimum expected CTA size from any launch Minimum number of CTAs on an SM Kernel function vs. device function Texture/surface variables more Use Named Metadata

30 NVVM IR: Intrinsics Atomic operations Barriers Address space conversions Special registers read Texture/surface access more

31 NVVM IR: NVVM ABI for PTX Allow interoperability of PTX codes generated by different NVVM compilers How NVVM linkage types are mapped to PTX linkage types How NVVM data types are mapped to PTX data types How function calls are lowered to PTX calls

32 libnvvm Library Address space access optimization Thread convergence analyses Re-materialization Load / store coalescing sign extension elimination Inline asm (PTX) support Tuning of existing optimizations New phi elimination libnvvm : An optimizing compiler library NVVM IR -> PTX

33 libnvvm Library CUDA C compiler DSL compiler Tools Your Compiler libnvvm : An optimizing compiler library NVVM IR -> PTX

34 libnvvm APIs Start and shutdown Create a compilation unit Add NVVM IR modules to a compilation unit. Support NVVM IR level linking Verify the input IR Compile to PTX Get result Get back PTX string Get back message log

35 Samples

36 SDK Sample 1: Simple Read in an NVVM IR program in LL or BC format from disk Generate PTX Execute the PTX on GPU // Read in fread (source, size, 1, fh); // Compile nvvminit (); nvvmcreatecu (&cu); nvvmcuaddmodule (cu, source, size); nvvmcompilecu (cu, num_options, options); nvvmgetcompiledresultsize (cu, &ptxsize); nvvmgetcompiledresult (cu, ptx); nvvmdestroycu (&cu); nvvmfini (); // Execute... cumoduleloaddataex (phmodule, ptx, 0, 0, 0); cumodulegetfunction (phkernel, *phmodule, "simple"); culaunchgrid (hkernel, nblocks, 1))

37 SDK Sample 1: Simple IR from file NVVM IR -> PTX libnvvm PTX JIT and Execution on GPU

38 SDK Sample 2: ptxgen Read in a list of NVVM IR programs in LL or BC format from disk Perform IR level linking for (i=0; i<num_files; i++) nvvmcuaddmodule (cu, source[i], size[i]); Generate PTX, or verify the IR against NVVM IR spec nvvmcompilecu (cu, 1, -target=nvptx ); nvvmcompilecu (cu, 1, -target=verify );

39 SDK Sample 2: ptxgen IR from file libnvvm NVVM IR -> PTX IR level linking IR Verifier

40 SDK Sample 3: Kaleidoscope Based on the Kaleidoscope example in LLVM tutorial An interpreter for a simple language build expressions using primitive operators and user defined functions Can evaluate a function on a GPU over a sequence of arguments Uses LLVM IR builder Statically linked with stock llvm 3.0 libllvmcore.a and libllvmbitwriter.a Dynamically linked with libnvvm.so and libcuda.so.

41 SDK Sample 3: Kaleidoscope ready> def foo(x) ready> 4*x; ready> eval foo ; ready> Evaluating foo on a gpu: Using CUDA Device [0]: GeForce GTX ready> def bar(a b) ready> foo(a)+b; ready> eval bar ; ready> Evaluating bar on a gpu: Using CUDA Device [0]: GeForce GTX

42 SDK Sample 3: Kaleidoscope interpreter LLVM IR Builder libnvvm NVVM IR -> PTX PTX JIT and Execution on GPU

43 SDK Sample 4: Glang A prototype compiler based on Clang Compile a subset of C++ programs to execute on GPUs Has its own set of builtin functions

44 glang_kernel kernel_vector_add( global const float* a, global const float* b, global float* c, unsigned int size) { // Determine thread/block id/size int tidx = glang_tid_x(); int bidx = glang_ctaid_x(); int bsize = glang_ntid_x(); // Compute offset into vector int index = bidx*bsize + tidx; } // Compute addition for our element if(index < size) { c[index] = a[index] + b[index]; }

45 SDK Samples clang libnvvm NVVM IR -> PTX IR level linking User builtin functions PTX JIT and Execution on GPU

46 SDK Sample 5: Rg R is a language and environment for statistical computing and graphics Dynamically compile R code and execute it on a GPU Supports a useful subset of R An example of how to accelerate a DSL using libnvvm

47 SDK Sample 5: Rg > v1 = c (1, 2, 3, 4) > dv1 = rg.dv (c (1, 2, 3, 4)) > dv2 = rg.dv (c (10, 20, 30, 40)) > dv3 = rg.gapply (function(x, y) {x+y;}, dv1, dv2) > as.double (dv3) [1]

48 SDK Sample 5: Rg # define the code for Mandelbrot mandelbrot <- function(x0, y0) { iteration <- 0L; max_iteration <- 50L; x <- 0; y <- 0; while ( (x*x + y*y < 4) && (iteration < max_iteration) ) { xtemp <- x*x - y*y + x0; y = 2*x*y + y0; x = xtemp; iteration = iteration + 1L; } color = iteration; color; } # create data dv_points_x=rg.dv(points_x); dv_points_y=rg.dv(points_y); # compile and run and get results! dv_points_color = rg.gapply(mandelbrot, dv_points_x, dv_points_y); colorvec = as.integer(dv_points_color);

49 Demo

50 SDK Sample 5: Rg R LLVM IR Builder NVVM IR -> PTX libnvvm PTX JIT and Execution on GPU

51 SDK Samples intepreter R IR from file clang LLVM IR Builder libnvvm NVVM IR -> PTX IR level linking IR Verifier User builtin functions PTX JIT and Execution on GPU

52 How to get it? Open to all NVIDIA registered developers Available with CUDA 5.0 Preview Release File bugs and post questions to NVIDIA forum Tag NVVM

54 Back up slides

55 SDK Sample 3: Kaleidoscope ks Interpreter

56 SDK Sample 4: Glang a.glang a.glang.cpp a.glang.hpp a.out host.cpp libglangrt.so libcuda.so

57 SDK Sample 4: Glang a.glang a.glang.cpp a.glang.hpp glang a.out host.cpp libglangrt.so libcuda.so

58 SDK Sample 5: Rg (Overview) Create R objects as proxies for GPU vector data Rg runtime handles data creation and transfers to and from the GPU Write scalar R functions Use rg.gapply to map the scalar function to the vector data Dynamically create code for the scalar function. Generated code is specialized to the runtime type of the respective vector elements Launch the generated code on the GPU and create an R object which is a proxy for the result.

59 SDK Sample 5: RG (Compiler) rg.gapply(f, v1, v2,, vn) compiles f and launches the generated code on vector arguments v1,, vn already resident on the GPU. v1,, vn are atomic vectors of element types t1,, tn discovered at runtime. Creates a type specialized LLVM IR for f with signature f: t1 x x tn -> T where T is the inferred type of the body of f

GPU Computing with NVIDIA s new Kepler Architecture

GPU Computing with NVIDIA s new Kepler Architecture Axel Koehler Sr. Solution Architect HPC HPC Advisory Council Meeting, March 13-15 2013, Lugano 1 NVIDIA: Parallel Computing Company GPUs: GeForce, Quadro,