Developing Portable CUDA C/C++ Code with Hemi

Size: px

Start display at page:

Download "Developing Portable CUDA C/C++ Code with Hemi"

Eustace York
6 years ago
Views:

1 Developing Portable CUDA C/C++ Code with Hemi Software development is as much about writing code fast as it is about writing fast code, and central to rapid development is software reuse and portability. When building heterogeneous applications, developers must be able to share code between projects, platforms, compilers, and target architectures. Ideally, libraries of domain-specific code should be easily retarget able. In this post I ll talk about Hemi, a simple open-source C++ header library that simplifies writing portable CUDA C/C++ code. In the screenshot below, both columns show a simple Black-Scholes code written to be compliable with either NVCC or a standard C++ host compiler, and also run able on either the CPU or a CUDA GPU. The right column is written using Hemi s macros and smart heterogeneous Array container class, hemi::array. Using Hemi, the length and complexity of this code is reduced by half. Portable CUDA C++ code without Hemi (left) and with Hemi (right). (Click for full resolution) CUDA C++ and the NVIDIA NVCC compiler tool chain provide a number of features designed to make it easier to write portable code, including language-level integration of host and device code and data, declaration specifies (e.g. host and device ) and preprocessor definitions (e.g. CUDACC ). Together, these features enable developers to write code that can be compiled and run on either the host, the device, or both. But as the left column above shows, using them directly can result in complicated code. One cause of this is the code duplication that is required to support multiple target platforms, and another cause is the verbose memory management incurred by heterogeneous memory spaces. Hemi aims to tackle both problems. Hemi is inspired by real-world CUDA software projects like PhysX and OptiX, which use custom libraries of preprocessor macros and container classes that enable the definition of portable application-specific libraries, classes, and kernels. PhysX, for example, has a comprehensive 3D vector math library that is portable across multiple platforms, including CUDA GPUs, Intel and other CPUs, and game consoles. To make CUDA memory management and transfers robust and simple to implement, PhysX uses a smart generic array class that automatically copies data between the device and host only when necessary. The result is much like the right-hand side of the screenshot above, with a minimum of memory management code and no explicit memory copies. In this post I ll describe Hemi in depth, but first I want to cover the CUDA C/C++ language and compiler features on which Hemi is built.

2 CUDA C++ LANGUAGE INTEGRATION AND PORTABILITY FEATURES Host / Device Functions If you are already programming in CUDA C/C++ then you are familiar with device, the declaration specifier that indicates a function that is callable from other device functions and kernel ( global ) functions. CUDA also provides the host declaration specifier for host (CPU) functions, which is the default in the absence of a specifier. Often we need to execute exactly the same code on the CPU and GPU, and in those cases we need to write functions that are callable from either the host or the device. In that case, host and device can be combined, as shown in the following inline function that averages two floats. host device inline float avgf(float x, float y) { return (x+y)/2.0f; When NVCC sees this function, it generates two versions of the code, one for the host and one for the device. Any calls to the function from device code will execute the device version, and any calls from host code will execute the host version. This host device combination is very powerful because it enables large utility code bases to be used across heterogeneous applications, minimizing the work required to port applications. However, other compilers (obviously) don t recognize these declaration specifiers, so to really write portable code, we need to use the C preprocessor. CUDA Preprocessor Definitions At compile time NVCC defines several macros that can be used to selectively enable and disable code based on whether it is being compiled by NVCC, whether it is device code or host code, and based on the architecture version (also called compute capability) it is being compiled for. NVCC Can be used in C/C++/CUDA source files to test whether they are currently being compiled by nvcc. CUDACC Can be used in source files to test whether they are being treated as CUDA source files by nvcc. CUDA_ARCH This architecture identification macro is assigned a three-digit value string xy0 (ending in a literal 0) when compiling device code compute_xy. For example, when compiling device code for compute_20 (or sm_20), CUDA_ARCH will be defined bynvcc to the value 200. This macro can be used in the implementation of device and kernel functions to determine the virtual architecture for which it is currently being compiled. Host code must not depend on this macro, but note that it is not defined

3 when host code is being compiled, which means that it can be used to detect compilation of device code. The following example combines declaration specifiers and preprocessor macros to write a portable routine for counting the number of bits that are set in a 32-bit word. #ifdef CUDACC host device #endif int countleadingzeros(unsigned int a) { #if defined( CUDA_ARCH ) return popc(a); #else // Source: a = a - ((a >> 1) & 0x ); a = (a & 0x ) + ((a >> 2) & 0x ); return ((a + (a >> 4) & 0xF0F0F0F) * 0x ) >> 24; #endif Here I have defined a function countsetbits that is callable from either host or device code and due to the check for CUDACC wrapping host device, it is compilable using NVCC or other C/C++ compilers. Whether or not it is compiled with NVCC, it uses arithmetic on the CPU to count the 1 bits. On the device, it uses CUDA s built-in popc() intrinsic. If you look in CUDA s device_functions.h header file, you ll see that the value of CUDA_ARCH is used to further differentiate; on Fermi and later GPUs (sm20, CUDA_ARCH == 200) popc() generates a single hardware population count instruction, while on earlier architectures it uses code similar to the host code. HEMI: EASIER PORTABLE CODE As you can see, CUDA makes writing portable code feasible and flexible, but doing so is not particularly simple. Hemi,available on Github, provides just two simple header files (and a few examples) that make the task much easier, with much clearer code. The hemi.h header provides simple macros that are useful for reusing code between CUDA C/C++ and C/C++ written for other platforms (e.g. CPUs). The macros are used to decorate function prototypes and variable declarations so that they can be compiled by either NVCC or a host compiler (for example gcc or cl.exe, the MS Visual Studio compiler). The macros can be used within.cu,.cuh,.cpp,.h, and.inl files to define code that can be compiled either for the host or the device. Before diving into the features of Hemi, let me draw your attention to the Hemi examples.

4 blackscholes: This is a simple example that performs a Black-Scholes options pricing calculation using code that is entirely shared between host code compiled with any C/C++ compiler (including NVCC) and device code that is compiled with NVCC. When compiled with nvcc -x cu (to force CUDA compilation on the.cpp file), this runs on the GPU. When compiled with nvcc or g++ it runs on the host. blackscholes_nohemi: Just like the above, except it doesn t use Hemi. This is just to demonstrate the complexity that Hemi eliminates. blackscholes_hostdevice: This example demonstrates how to write portable code that can be compiled to run the same code on both the host and device, in a single compile & run. This increase in run-time flexibility has a slight complexity cost, but all of the core computational code is reused. blackscholes_hemiarray: This example is the same as the blackscholes example, except that it uses hemi::array to encapsulate CUDA-specific memory management code, and eliminate all explicit host-device memory copy code. nbody_vec4: This example brings all of Hemi s features together. It implements a simple all-pairs n-body gravitational force calculation using a 4D vector class called Vec4f, which uses Hemi macros to enable all of the code for the class to be shared between host code compiled by the host compiler and device or host code compiled with NVCC. nbody_vec4 also shares most of the all-pairs gravitational force calculation code between device and host, and demonstrates how optimized device implementations (e.g. using shared memory) can be substituted as needed. Finally, this sample also uses hemi::array to simplify memory management and data transfers. HEMI PORTABLE FUNCTIONS A typical use for host-device code sharing is commonly used utility functions. For example, here is a portable version of our earlier example function that averages two floats. HEMI_DEV_CALLABLE_INLINE float avgf(float x, float y) { return (x+y)/2.0f; This function can be called either from host code or device code, and can be compiled by either the host compiler or NVCC. The macro definition ensures that when compiled by NVCC, both a host and device version of the function are generated, and a normal inline function is generated when compiled by the host compiler. For another example use, see the CND() function defined in the blackscholes example included with Hemi, as well as several other functions used in the examples.

5 HEMI PORTABLE CLASSES The HEMI_DEV_CALLABLE_MEMBER and HEMI_DEV_CALLABLE_INLINE_MEMB ER macros can be used to create classes that are reusable between host and device code, by decorating any member function prototype that will be used by both device and host code. Here is an example excerpt of a portable class (a 4D vector type used in the nbody_vec4 example). struct HEMI_ALIGN(16) Vec4f { float x, y, z, w; HEMI_DEV_CALLABLE_INLINE_MEMBER Vec4f() {; HEMI_DEV_CALLABLE_INLINE_MEMBER Vec4f(float xx, float yy, float zz, float ww) : x(xx), y(yy), z(zz), w(ww) { HEMI_DEV_CALLABLE_INLINE_MEMBER Vec4f(const Vec4f& v) : x(v.x), y(v.y), z(v.z), w(v.w) { HEMI_DEV_CALLABLE_INLINE_MEMBER Vec4f& operator=(const Vec4f& v) { x = v.x; y = v.y; z = v.z; w = v.w; return *this; HEMI_DEV_CALLABLE_INLINE_MEMBER Vec4f operator+(const Vec4f& v) const { return Vec4f(x+v.x, y+v.y, z+v.z, w+v.w);... ; The HEMI_ALIGN macro is used on types that will be passed in arrays or pointers as arguments to CUDA device kernel functions, to ensure proper alignment. HEMI_ALIGN generates correct alignment specifiers for host compilers, too. For details on alignment, see the NVIDIA CUDA C Programming Guide (Section 5.3 in v5.0). NOTE: DEVICE-SPECIFIC CODE

6 Code in functions declared with HEMI_DEV_CALLABLE_* must be portable. In other words it must compile and run correctly for both the host and the device. If it does not, within the function you can use HEMI_DEV_CODE (which reduces to CUDA_ARCH ) to define separate code for host and device, as in the following example. HEMI_DEV_CALLABLE_INLINE_MEMBER float inverselength(float softening = 0.0f) const { #ifdef HEMI_DEV_CODE return rsqrtf(lengthsqr() + softening); // use fast GPU intrinsic #else return 1.0f / sqrtf(lengthsqr() + softening); #endif If you need to write a function only for the device, use the CUDA C device specifier directly. Note: Non-inline functions and methods Take care when using the non-inline versions of the declaration specifier macros (HEMI_DEV_CALLABLE andhemi_dev_callable_member) to avoid multiple definition linker errors due to using these in headers that are included into multiple compilation units. The best way to use HEMI_DEV_CALLABLE is to declare functions using this macro in a header, and define their implementation in a.cu file, and compile it with NVCC. This will generate code for both host and device. The host code will be linked into your library or application and callable from other host code compilation units (.c and.cpp files). Likewise, for HEMI_DEV_CALLABLE_MEMBER, put the class and function declaration in a header, and the member function implementations in a.cu file, compiled by NVCC. HEMI PORTABLE KERNELS Use HEMI_KERNEL to declare functions that are launchable as CUDA kernels when compiled with NVCC, or callable as C/C++ (host) functions when compiled with the host compiler. HEMI_KERNEL_LAUNCH is a convenience macro that launches a kernel function on the device when compiled with NVCC, or calls the host function when compiled with the host compiler. For example, here is an excerpt from the blackscholes example, which is a single.cpp file that can be either compiled with NVCC to run on the GPU, or compiled with the host compiler to run on the CPU. // Black-Scholes formula for both call and put

7 HEMI_KERNEL(BlackScholes) (float *callresult, float *putresult, const float *stockprice, const float *optionstrike, const float *optionyears, float Riskfree, float Volatility, int optn) {... //... in main()... HEMI_KERNEL_LAUNCH(BlackScholes, griddim, blockdim, 0, 0, d_callresult, d_putresult, d_stockprice, d_optionstrike, d_optionyears, RISKFREE, VOLATILITY, OPT_N); HEMI_KERNEL_LAUNCH requires grid and block dimensions to be passed to it, but these parameters are ignored when compiled for the host. When DEBUG is defined, HEMI_KERNEL_LAUNCH checks for CUDA launch and run-time errors. You can use HEMI_KERNEL_NAME to access the generated name of the kernel function, for example to pass a function pointer to CUDA API functions like cudafuncgetattributes(). Iteration For kernel functions with simple independent element-wise parallelism, Hemi provides two functions to enable iterating over elements sequentially in host code or in parallel in device code. hemigetelementoffset() returns the offset of the current thread within the 1D grid, or zero for host code. In device code, it resolves to blockdim.x * blockidx.x + threadidx.x. hemigetelementstride() returns the size of the 1D grid in threads, or one in host code. In device code, it resolves to griddim.x * blockdim.x. The blackscholes example demonstrates iteration in the following function, which can be compiled and run as a sequential function on the host or as a CUDA kernel on the device. // Black-Scholes formula for both call and put HEMI_KERNEL(BlackScholes) (float *callresult, float *putresult, const float *stockprice, const float *optionstrike, const float *optionyears, float Riskfree, float Volatility, int optn) { int offset = hemigetelementoffset(); int stride = hemigetelementstride(); for(int opt = offset; opt < optn; opt += stride)

8 { //... compute call and put value based on Black-Scholes formula Note: the hemigetelement*() functions are specialized to simple (but common) element-wise parallelism. As such, they may not be useful for arbitrary strides, data sharing, or other more complex parallelism arrangements; but they may serve as examples for creating your own. HEMI PORTABLE CONSTANTS Global constant values can be defined using the HEMI_DEFINE_CONSTANT macro, which takes a name and an initial value. When compiled with NVCC as CUDA code, this declares two versions of the constant, one constant variable for the device, and one normal host variable. When compiled with a host compiler, only the host variable is defined. For static or external linkage, use the HEMI_DEFINE_STATIC_CONSTANT andhemi_define_extern_constant versions of the macro, respectively. To access variables defined usinghemi_define_*_constant macros, use the HEMI_CONSTANT macro which automatically resolves to either the device or host constant depending on whether it is called from device or host code. This means that the proper variable will be chosen when the constant is accessed within functions declared with HEMI_DEV_CALLABLE_* andhemi_kernel macros. To explicitly access the device version of a constant, use HEMI_DEV_CONSTANT. This is useful when the constant is an argument to a CUDA API function such as cudamemcpytosymbol, as shown in the following code from the nbody_vec4 example. cudamemcpytosymbol(hemi_dev_constant(softeningsquared), &ss, sizeof(float), 0, cudamemcpyhosttodevice) HEMI PORTABLE DATA: HEMI::ARRAY One of the biggest challenges in writing portable CUDA code is memory management. Hemi provides thehemi::array C++ template class (defined in hemi/array.h), a simple data management container which allows arrays of arbitrary type to be created and used with both host and device code. hemi::array maintains a host and a device pointer for each array. It lazily transfers data between the host and device as needed when the user requests a pointer to the host or device memory. Pointer requests specify read-only, read/write, or write-only options to

9 keep the valid location of data up-to-date and only copy data when the requested pointer is invalid. hemi::array supports pinned host memory for efficient PCI-express transfers, and handles CUDA error checking internally. Here is an excerpt from the nbody_vec4 example. hemi::array<vec4f> bodies(n, true); hemi::array<vec4f> forcevectors(n, true); randomizebodies(bodies.writeonlyhostptr(), N); // Call host function defined in a.cpp compilation unit allpairsforceshost(forcevectors.writeonlyhostptr(), bodies.readonlyhostptr(), N); printf("cpu: Force vector 0: (%0.3f, %0.3f, %0.3f)\n", forcevectors.readonlyhostptr()[0].x, forcevectors.readonlyhostptr()[0].y, forcevectors.readonlyhostptr()[0].z);... // Call device function defined in a.cu compilation unit // that uses host/device shared functions and class member functions allpairsforcescuda(forcevectors.writeonlydeviceptr(), bodies.readonlydeviceptr(), N, false); printf("gpu: Force vector 0: (%0.3f, %0.3f, %0.3f)\n", forcevectors.readonlyhostptr()[0].x, forcevectors.readonlyhostptr()[0].y, forcevectors.readonlyhostptr()[0].z); Typical CUDA code requires explicit duplication of host allocations on the device, and explicit copy calls between them, along with error checking for all allocations and transfers. The blackscholes_hemiarray example demonstrates how much hemi::array simplifies CUDA C code, doing with 136 lines of code what the blackscholes example does in 180 lines. HEMI CUDA ERROR CHECKING hemi.h provides two convenience functions for checking CUDA errors. checkcuda verifies that its single argument has the value cudasuccess, and otherwise prints an error message and asserts if #DEBUG is defined. This function is typically wrapped around CUDA API calls, as in the following. checkcuda( cudamemcpy(d_price, price, OPT_SIZE, cudamemcpyhosttodevice) );

10 checkcudaerrors takes no arguments and checks the current state of the CUDA context for errors. This function synchronizes the CUDA device (cudadevicesynchronize()) to ensure asynchronous launch errors are caught. BothcheckCuda and checkcudaerrors act as No-ops when DEBUG is not defined (release builds). SUMMARY: MIX AND MATCH I designed Hemi to provide a loosely-coupled set of utilities and examples for creating reusable, portable CUDA C/C++ code. Feel free to use the parts that you need and ignore others, or modify and replace portions to suit the needs of your projects. Or just use it as an example and develop your own utilities for writing flexible and portable CUDA code. If you make changes that you feel would be generally useful, please fork the project on github, commit your changes, and submit a pull request! If you would like to give feedback about Hemi, please contact me using the contact form or by filing an issue on Github. Share functionality using Portable Class Libraries Applies to: Windows Phone 8 Windows 8 This topic explains what a Portable Class Library is and how you can use it to share code between your apps for Windows Phone 8 and Windows 8. This topic contains the following sections. What is a Portable Class Library? How to use a Portable Class Library in your app for Windows Phone 8 and Windows 8 What to share in a Portable Class Library Portable Class Libraries and MVVM Related Topics What is a Portable Class Library? Portable Class Libraries have been available since.net Framework 4. You can use them to create portable assemblies that can target multiple platforms, including Windows 7, Windows 8, Windows Phone, Silverlight, and Xbox 360, as demonstrated in the following image. Portable Class Libraries support a subset of.net assemblies that target the platforms you choose. Visual Studio 2012 Pro and greater versions come with a project template that you can use to create Portable Class Libraries. This is a very good way to reduce time and cost by sharing functionality across the platforms you want to support.

How to use a Portable Class Library in your app for Windows Phone 8 and Windows 8 You can use Portable Class Libraries to share functionality between your apps for Windows Phone

The template is available only in Visual Studio 2012 Pro or greater versions. The following diagram demonstrates how both apps share a Portable Class Library.

11 How to use a Portable Class Library in your app for Windows Phone 8 and Windows 8 You can use Portable Class Libraries to share functionality between your apps for Windows Phone 8 and Windows 8. Note that the Express versions of Visual Studio 2012 don t include a Portable Class Library project template. The template is available only in Visual Studio 2012 Pro or greater versions. The following diagram demonstrates how both apps share a Portable Class Library. To reference a Portable Class Library, in Solution Explorer, select your project, and then choose Add Reference. Point to either a binary of the Portable Class Library or to the Portable Class Library project. What to share in a Portable Class Library

12 When you create your app for Windows Phone 8 and Windows 8, you should identify portable code. Place this code in a Portable Class Library and share the portable library between both apps. Portable code has the following characteristics: Managed (C# or VB) code Portable Class Libraries is a.net concept and supports managed code only. Because Windows Phone 8 and Windows 8 share the same.net engine, a lot of the managed code you write, particularly app logic, has the potential to be portable. Doesn t use conditional compilation A Portable Class Library is compiled against a set of portable.net assemblies for the platforms you want to target. If you re building an app for Windows Phone 8 and Windows 8, this means a set of.net assemblies that are portable on those platforms. A conditional compilation directive is intended to enable different code paths when the code is compiled for different platforms or configurations. This isn t the purpose of Portable Class Libraries. If you need to implement functionality for Windows Phone 8 and implement it differently for Windows 8, you can t include both code paths in a Portable Class Library. Instead, you should abstract away the platform-dependent code and share only the portable, platform-independent code. Doesn t use Windows Runtime APIs Windows Runtime APIs aren t portable and can t be used in a Portable Class Library. There is overlap in the Windows Runtime APIs that are supported on Windows Phone 8 and Windows 8. However, binary compatibility is not supported. Your code has to be compiled for each platform and therefore isn t suitable for a Portable Class Library. Here too you should abstract the use of Windows Runtime APIs into classes or objects that aren t shared in a Portable Class Library. Doesn t use UI constructs Although XAML for Windows Phone 8 and Windows 8 looks the same and for the most part UI controls have the same names, this code isn t portable. Your UI code must be compiled for each platform and therefore isn t a candidate for placement in a Portable Class Library. Portable Class Libraries and MVVM When you create your app for Windows Phone 8 and Windows 8 using the Model-View-ViewModel (MVVM) pattern and using.net APIs, you have the potential to share a lot of code in a Portable Class Library. Your ViewModel and Model can be designed to be portable and you should place these in a Portable Class Library. The views of your app, and the startup code, typically are platform-specific and should be implemented in your Windows Phone 8 and Windows 8 app projects. This is illustrated in the following diagram.

If your ViewModel needs to call platform-specific code, you should abstract that functionality into the platform-independent interface and use the interface in the Portable Class Library.

13 If your ViewModel needs to call platform-specific code, you should abstract that functionality into the platform-independent interface and use the interface in the Portable Class Library. The interface can then be implemented in a platform-specific way in each app project. This is a very powerful code-sharing technique and allows binary sharing because the Portable Class Library is compiled once and then used in multiple platforms.

CUDA C Programming Mark Harris NVIDIA Corporation

CUDA C Programming Mark Harris NVIDIA Corporation Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory Models Programming Environment