Developing Portable CUDA C/C++ Code with Hemi
|
|
- Eustace York
- 6 years ago
- Views:
Transcription
1 Developing Portable CUDA C/C++ Code with Hemi Software development is as much about writing code fast as it is about writing fast code, and central to rapid development is software reuse and portability. When building heterogeneous applications, developers must be able to share code between projects, platforms, compilers, and target architectures. Ideally, libraries of domain-specific code should be easily retarget able. In this post I ll talk about Hemi, a simple open-source C++ header library that simplifies writing portable CUDA C/C++ code. In the screenshot below, both columns show a simple Black-Scholes code written to be compliable with either NVCC or a standard C++ host compiler, and also run able on either the CPU or a CUDA GPU. The right column is written using Hemi s macros and smart heterogeneous Array container class, hemi::array. Using Hemi, the length and complexity of this code is reduced by half. Portable CUDA C++ code without Hemi (left) and with Hemi (right). (Click for full resolution) CUDA C++ and the NVIDIA NVCC compiler tool chain provide a number of features designed to make it easier to write portable code, including language-level integration of host and device code and data, declaration specifies (e.g. host and device ) and preprocessor definitions (e.g. CUDACC ). Together, these features enable developers to write code that can be compiled and run on either the host, the device, or both. But as the left column above shows, using them directly can result in complicated code. One cause of this is the code duplication that is required to support multiple target platforms, and another cause is the verbose memory management incurred by heterogeneous memory spaces. Hemi aims to tackle both problems. Hemi is inspired by real-world CUDA software projects like PhysX and OptiX, which use custom libraries of preprocessor macros and container classes that enable the definition of portable application-specific libraries, classes, and kernels. PhysX, for example, has a comprehensive 3D vector math library that is portable across multiple platforms, including CUDA GPUs, Intel and other CPUs, and game consoles. To make CUDA memory management and transfers robust and simple to implement, PhysX uses a smart generic array class that automatically copies data between the device and host only when necessary. The result is much like the right-hand side of the screenshot above, with a minimum of memory management code and no explicit memory copies. In this post I ll describe Hemi in depth, but first I want to cover the CUDA C/C++ language and compiler features on which Hemi is built.
2 CUDA C++ LANGUAGE INTEGRATION AND PORTABILITY FEATURES Host / Device Functions If you are already programming in CUDA C/C++ then you are familiar with device, the declaration specifier that indicates a function that is callable from other device functions and kernel ( global ) functions. CUDA also provides the host declaration specifier for host (CPU) functions, which is the default in the absence of a specifier. Often we need to execute exactly the same code on the CPU and GPU, and in those cases we need to write functions that are callable from either the host or the device. In that case, host and device can be combined, as shown in the following inline function that averages two floats. host device inline float avgf(float x, float y) { return (x+y)/2.0f; When NVCC sees this function, it generates two versions of the code, one for the host and one for the device. Any calls to the function from device code will execute the device version, and any calls from host code will execute the host version. This host device combination is very powerful because it enables large utility code bases to be used across heterogeneous applications, minimizing the work required to port applications. However, other compilers (obviously) don t recognize these declaration specifiers, so to really write portable code, we need to use the C preprocessor. CUDA Preprocessor Definitions At compile time NVCC defines several macros that can be used to selectively enable and disable code based on whether it is being compiled by NVCC, whether it is device code or host code, and based on the architecture version (also called compute capability) it is being compiled for. NVCC Can be used in C/C++/CUDA source files to test whether they are currently being compiled by nvcc. CUDACC Can be used in source files to test whether they are being treated as CUDA source files by nvcc. CUDA_ARCH This architecture identification macro is assigned a three-digit value string xy0 (ending in a literal 0) when compiling device code compute_xy. For example, when compiling device code for compute_20 (or sm_20), CUDA_ARCH will be defined bynvcc to the value 200. This macro can be used in the implementation of device and kernel functions to determine the virtual architecture for which it is currently being compiled. Host code must not depend on this macro, but note that it is not defined
3 when host code is being compiled, which means that it can be used to detect compilation of device code. The following example combines declaration specifiers and preprocessor macros to write a portable routine for counting the number of bits that are set in a 32-bit word. #ifdef CUDACC host device #endif int countleadingzeros(unsigned int a) { #if defined( CUDA_ARCH ) return popc(a); #else // Source: a = a - ((a >> 1) & 0x ); a = (a & 0x ) + ((a >> 2) & 0x ); return ((a + (a >> 4) & 0xF0F0F0F) * 0x ) >> 24; #endif Here I have defined a function countsetbits that is callable from either host or device code and due to the check for CUDACC wrapping host device, it is compilable using NVCC or other C/C++ compilers. Whether or not it is compiled with NVCC, it uses arithmetic on the CPU to count the 1 bits. On the device, it uses CUDA s built-in popc() intrinsic. If you look in CUDA s device_functions.h header file, you ll see that the value of CUDA_ARCH is used to further differentiate; on Fermi and later GPUs (sm20, CUDA_ARCH == 200) popc() generates a single hardware population count instruction, while on earlier architectures it uses code similar to the host code. HEMI: EASIER PORTABLE CODE As you can see, CUDA makes writing portable code feasible and flexible, but doing so is not particularly simple. Hemi,available on Github, provides just two simple header files (and a few examples) that make the task much easier, with much clearer code. The hemi.h header provides simple macros that are useful for reusing code between CUDA C/C++ and C/C++ written for other platforms (e.g. CPUs). The macros are used to decorate function prototypes and variable declarations so that they can be compiled by either NVCC or a host compiler (for example gcc or cl.exe, the MS Visual Studio compiler). The macros can be used within.cu,.cuh,.cpp,.h, and.inl files to define code that can be compiled either for the host or the device. Before diving into the features of Hemi, let me draw your attention to the Hemi examples.
4 blackscholes: This is a simple example that performs a Black-Scholes options pricing calculation using code that is entirely shared between host code compiled with any C/C++ compiler (including NVCC) and device code that is compiled with NVCC. When compiled with nvcc -x cu (to force CUDA compilation on the.cpp file), this runs on the GPU. When compiled with nvcc or g++ it runs on the host. blackscholes_nohemi: Just like the above, except it doesn t use Hemi. This is just to demonstrate the complexity that Hemi eliminates. blackscholes_hostdevice: This example demonstrates how to write portable code that can be compiled to run the same code on both the host and device, in a single compile & run. This increase in run-time flexibility has a slight complexity cost, but all of the core computational code is reused. blackscholes_hemiarray: This example is the same as the blackscholes example, except that it uses hemi::array to encapsulate CUDA-specific memory management code, and eliminate all explicit host-device memory copy code. nbody_vec4: This example brings all of Hemi s features together. It implements a simple all-pairs n-body gravitational force calculation using a 4D vector class called Vec4f, which uses Hemi macros to enable all of the code for the class to be shared between host code compiled by the host compiler and device or host code compiled with NVCC. nbody_vec4 also shares most of the all-pairs gravitational force calculation code between device and host, and demonstrates how optimized device implementations (e.g. using shared memory) can be substituted as needed. Finally, this sample also uses hemi::array to simplify memory management and data transfers. HEMI PORTABLE FUNCTIONS A typical use for host-device code sharing is commonly used utility functions. For example, here is a portable version of our earlier example function that averages two floats. HEMI_DEV_CALLABLE_INLINE float avgf(float x, float y) { return (x+y)/2.0f; This function can be called either from host code or device code, and can be compiled by either the host compiler or NVCC. The macro definition ensures that when compiled by NVCC, both a host and device version of the function are generated, and a normal inline function is generated when compiled by the host compiler. For another example use, see the CND() function defined in the blackscholes example included with Hemi, as well as several other functions used in the examples.
5 HEMI PORTABLE CLASSES The HEMI_DEV_CALLABLE_MEMBER and HEMI_DEV_CALLABLE_INLINE_MEMB ER macros can be used to create classes that are reusable between host and device code, by decorating any member function prototype that will be used by both device and host code. Here is an example excerpt of a portable class (a 4D vector type used in the nbody_vec4 example). struct HEMI_ALIGN(16) Vec4f { float x, y, z, w; HEMI_DEV_CALLABLE_INLINE_MEMBER Vec4f() {; HEMI_DEV_CALLABLE_INLINE_MEMBER Vec4f(float xx, float yy, float zz, float ww) : x(xx), y(yy), z(zz), w(ww) { HEMI_DEV_CALLABLE_INLINE_MEMBER Vec4f(const Vec4f& v) : x(v.x), y(v.y), z(v.z), w(v.w) { HEMI_DEV_CALLABLE_INLINE_MEMBER Vec4f& operator=(const Vec4f& v) { x = v.x; y = v.y; z = v.z; w = v.w; return *this; HEMI_DEV_CALLABLE_INLINE_MEMBER Vec4f operator+(const Vec4f& v) const { return Vec4f(x+v.x, y+v.y, z+v.z, w+v.w);... ; The HEMI_ALIGN macro is used on types that will be passed in arrays or pointers as arguments to CUDA device kernel functions, to ensure proper alignment. HEMI_ALIGN generates correct alignment specifiers for host compilers, too. For details on alignment, see the NVIDIA CUDA C Programming Guide (Section 5.3 in v5.0). NOTE: DEVICE-SPECIFIC CODE
6 Code in functions declared with HEMI_DEV_CALLABLE_* must be portable. In other words it must compile and run correctly for both the host and the device. If it does not, within the function you can use HEMI_DEV_CODE (which reduces to CUDA_ARCH ) to define separate code for host and device, as in the following example. HEMI_DEV_CALLABLE_INLINE_MEMBER float inverselength(float softening = 0.0f) const { #ifdef HEMI_DEV_CODE return rsqrtf(lengthsqr() + softening); // use fast GPU intrinsic #else return 1.0f / sqrtf(lengthsqr() + softening); #endif If you need to write a function only for the device, use the CUDA C device specifier directly. Note: Non-inline functions and methods Take care when using the non-inline versions of the declaration specifier macros (HEMI_DEV_CALLABLE andhemi_dev_callable_member) to avoid multiple definition linker errors due to using these in headers that are included into multiple compilation units. The best way to use HEMI_DEV_CALLABLE is to declare functions using this macro in a header, and define their implementation in a.cu file, and compile it with NVCC. This will generate code for both host and device. The host code will be linked into your library or application and callable from other host code compilation units (.c and.cpp files). Likewise, for HEMI_DEV_CALLABLE_MEMBER, put the class and function declaration in a header, and the member function implementations in a.cu file, compiled by NVCC. HEMI PORTABLE KERNELS Use HEMI_KERNEL to declare functions that are launchable as CUDA kernels when compiled with NVCC, or callable as C/C++ (host) functions when compiled with the host compiler. HEMI_KERNEL_LAUNCH is a convenience macro that launches a kernel function on the device when compiled with NVCC, or calls the host function when compiled with the host compiler. For example, here is an excerpt from the blackscholes example, which is a single.cpp file that can be either compiled with NVCC to run on the GPU, or compiled with the host compiler to run on the CPU. // Black-Scholes formula for both call and put
7 HEMI_KERNEL(BlackScholes) (float *callresult, float *putresult, const float *stockprice, const float *optionstrike, const float *optionyears, float Riskfree, float Volatility, int optn) {... //... in main()... HEMI_KERNEL_LAUNCH(BlackScholes, griddim, blockdim, 0, 0, d_callresult, d_putresult, d_stockprice, d_optionstrike, d_optionyears, RISKFREE, VOLATILITY, OPT_N); HEMI_KERNEL_LAUNCH requires grid and block dimensions to be passed to it, but these parameters are ignored when compiled for the host. When DEBUG is defined, HEMI_KERNEL_LAUNCH checks for CUDA launch and run-time errors. You can use HEMI_KERNEL_NAME to access the generated name of the kernel function, for example to pass a function pointer to CUDA API functions like cudafuncgetattributes(). Iteration For kernel functions with simple independent element-wise parallelism, Hemi provides two functions to enable iterating over elements sequentially in host code or in parallel in device code. hemigetelementoffset() returns the offset of the current thread within the 1D grid, or zero for host code. In device code, it resolves to blockdim.x * blockidx.x + threadidx.x. hemigetelementstride() returns the size of the 1D grid in threads, or one in host code. In device code, it resolves to griddim.x * blockdim.x. The blackscholes example demonstrates iteration in the following function, which can be compiled and run as a sequential function on the host or as a CUDA kernel on the device. // Black-Scholes formula for both call and put HEMI_KERNEL(BlackScholes) (float *callresult, float *putresult, const float *stockprice, const float *optionstrike, const float *optionyears, float Riskfree, float Volatility, int optn) { int offset = hemigetelementoffset(); int stride = hemigetelementstride(); for(int opt = offset; opt < optn; opt += stride)
8 { //... compute call and put value based on Black-Scholes formula Note: the hemigetelement*() functions are specialized to simple (but common) element-wise parallelism. As such, they may not be useful for arbitrary strides, data sharing, or other more complex parallelism arrangements; but they may serve as examples for creating your own. HEMI PORTABLE CONSTANTS Global constant values can be defined using the HEMI_DEFINE_CONSTANT macro, which takes a name and an initial value. When compiled with NVCC as CUDA code, this declares two versions of the constant, one constant variable for the device, and one normal host variable. When compiled with a host compiler, only the host variable is defined. For static or external linkage, use the HEMI_DEFINE_STATIC_CONSTANT andhemi_define_extern_constant versions of the macro, respectively. To access variables defined usinghemi_define_*_constant macros, use the HEMI_CONSTANT macro which automatically resolves to either the device or host constant depending on whether it is called from device or host code. This means that the proper variable will be chosen when the constant is accessed within functions declared with HEMI_DEV_CALLABLE_* andhemi_kernel macros. To explicitly access the device version of a constant, use HEMI_DEV_CONSTANT. This is useful when the constant is an argument to a CUDA API function such as cudamemcpytosymbol, as shown in the following code from the nbody_vec4 example. cudamemcpytosymbol(hemi_dev_constant(softeningsquared), &ss, sizeof(float), 0, cudamemcpyhosttodevice) HEMI PORTABLE DATA: HEMI::ARRAY One of the biggest challenges in writing portable CUDA code is memory management. Hemi provides thehemi::array C++ template class (defined in hemi/array.h), a simple data management container which allows arrays of arbitrary type to be created and used with both host and device code. hemi::array maintains a host and a device pointer for each array. It lazily transfers data between the host and device as needed when the user requests a pointer to the host or device memory. Pointer requests specify read-only, read/write, or write-only options to
9 keep the valid location of data up-to-date and only copy data when the requested pointer is invalid. hemi::array supports pinned host memory for efficient PCI-express transfers, and handles CUDA error checking internally. Here is an excerpt from the nbody_vec4 example. hemi::array<vec4f> bodies(n, true); hemi::array<vec4f> forcevectors(n, true); randomizebodies(bodies.writeonlyhostptr(), N); // Call host function defined in a.cpp compilation unit allpairsforceshost(forcevectors.writeonlyhostptr(), bodies.readonlyhostptr(), N); printf("cpu: Force vector 0: (%0.3f, %0.3f, %0.3f)\n", forcevectors.readonlyhostptr()[0].x, forcevectors.readonlyhostptr()[0].y, forcevectors.readonlyhostptr()[0].z);... // Call device function defined in a.cu compilation unit // that uses host/device shared functions and class member functions allpairsforcescuda(forcevectors.writeonlydeviceptr(), bodies.readonlydeviceptr(), N, false); printf("gpu: Force vector 0: (%0.3f, %0.3f, %0.3f)\n", forcevectors.readonlyhostptr()[0].x, forcevectors.readonlyhostptr()[0].y, forcevectors.readonlyhostptr()[0].z); Typical CUDA code requires explicit duplication of host allocations on the device, and explicit copy calls between them, along with error checking for all allocations and transfers. The blackscholes_hemiarray example demonstrates how much hemi::array simplifies CUDA C code, doing with 136 lines of code what the blackscholes example does in 180 lines. HEMI CUDA ERROR CHECKING hemi.h provides two convenience functions for checking CUDA errors. checkcuda verifies that its single argument has the value cudasuccess, and otherwise prints an error message and asserts if #DEBUG is defined. This function is typically wrapped around CUDA API calls, as in the following. checkcuda( cudamemcpy(d_price, price, OPT_SIZE, cudamemcpyhosttodevice) );
10 checkcudaerrors takes no arguments and checks the current state of the CUDA context for errors. This function synchronizes the CUDA device (cudadevicesynchronize()) to ensure asynchronous launch errors are caught. BothcheckCuda and checkcudaerrors act as No-ops when DEBUG is not defined (release builds). SUMMARY: MIX AND MATCH I designed Hemi to provide a loosely-coupled set of utilities and examples for creating reusable, portable CUDA C/C++ code. Feel free to use the parts that you need and ignore others, or modify and replace portions to suit the needs of your projects. Or just use it as an example and develop your own utilities for writing flexible and portable CUDA code. If you make changes that you feel would be generally useful, please fork the project on github, commit your changes, and submit a pull request! If you would like to give feedback about Hemi, please contact me using the contact form or by filing an issue on Github. Share functionality using Portable Class Libraries Applies to: Windows Phone 8 Windows 8 This topic explains what a Portable Class Library is and how you can use it to share code between your apps for Windows Phone 8 and Windows 8. This topic contains the following sections. What is a Portable Class Library? How to use a Portable Class Library in your app for Windows Phone 8 and Windows 8 What to share in a Portable Class Library Portable Class Libraries and MVVM Related Topics What is a Portable Class Library? Portable Class Libraries have been available since.net Framework 4. You can use them to create portable assemblies that can target multiple platforms, including Windows 7, Windows 8, Windows Phone, Silverlight, and Xbox 360, as demonstrated in the following image. Portable Class Libraries support a subset of.net assemblies that target the platforms you choose. Visual Studio 2012 Pro and greater versions come with a project template that you can use to create Portable Class Libraries. This is a very good way to reduce time and cost by sharing functionality across the platforms you want to support.
11 How to use a Portable Class Library in your app for Windows Phone 8 and Windows 8 You can use Portable Class Libraries to share functionality between your apps for Windows Phone 8 and Windows 8. Note that the Express versions of Visual Studio 2012 don t include a Portable Class Library project template. The template is available only in Visual Studio 2012 Pro or greater versions. The following diagram demonstrates how both apps share a Portable Class Library. To reference a Portable Class Library, in Solution Explorer, select your project, and then choose Add Reference. Point to either a binary of the Portable Class Library or to the Portable Class Library project. What to share in a Portable Class Library
12 When you create your app for Windows Phone 8 and Windows 8, you should identify portable code. Place this code in a Portable Class Library and share the portable library between both apps. Portable code has the following characteristics: Managed (C# or VB) code Portable Class Libraries is a.net concept and supports managed code only. Because Windows Phone 8 and Windows 8 share the same.net engine, a lot of the managed code you write, particularly app logic, has the potential to be portable. Doesn t use conditional compilation A Portable Class Library is compiled against a set of portable.net assemblies for the platforms you want to target. If you re building an app for Windows Phone 8 and Windows 8, this means a set of.net assemblies that are portable on those platforms. A conditional compilation directive is intended to enable different code paths when the code is compiled for different platforms or configurations. This isn t the purpose of Portable Class Libraries. If you need to implement functionality for Windows Phone 8 and implement it differently for Windows 8, you can t include both code paths in a Portable Class Library. Instead, you should abstract away the platform-dependent code and share only the portable, platform-independent code. Doesn t use Windows Runtime APIs Windows Runtime APIs aren t portable and can t be used in a Portable Class Library. There is overlap in the Windows Runtime APIs that are supported on Windows Phone 8 and Windows 8. However, binary compatibility is not supported. Your code has to be compiled for each platform and therefore isn t suitable for a Portable Class Library. Here too you should abstract the use of Windows Runtime APIs into classes or objects that aren t shared in a Portable Class Library. Doesn t use UI constructs Although XAML for Windows Phone 8 and Windows 8 looks the same and for the most part UI controls have the same names, this code isn t portable. Your UI code must be compiled for each platform and therefore isn t a candidate for placement in a Portable Class Library. Portable Class Libraries and MVVM When you create your app for Windows Phone 8 and Windows 8 using the Model-View-ViewModel (MVVM) pattern and using.net APIs, you have the potential to share a lot of code in a Portable Class Library. Your ViewModel and Model can be designed to be portable and you should place these in a Portable Class Library. The views of your app, and the startup code, typically are platform-specific and should be implemented in your Windows Phone 8 and Windows 8 app projects. This is illustrated in the following diagram.
13 If your ViewModel needs to call platform-specific code, you should abstract that functionality into the platform-independent interface and use the interface in the Portable Class Library. The interface can then be implemented in a platform-specific way in each app project. This is a very powerful code-sharing technique and allows binary sharing because the Portable Class Library is compiled once and then used in multiple platforms.
CUDA C Programming Mark Harris NVIDIA Corporation
CUDA C Programming Mark Harris NVIDIA Corporation Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory Models Programming Environment
More informationCUDA Kenjiro Taura 1 / 36
CUDA Kenjiro Taura 1 / 36 Contents 1 Overview 2 CUDA Basics 3 Kernels 4 Threads and thread blocks 5 Moving data between host and device 6 Data sharing among threads in the device 2 / 36 Contents 1 Overview
More informationLecture 3: Introduction to CUDA
CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Introduction to CUDA Some slides here are adopted from: NVIDIA teaching kit Mohamed Zahran (aka Z) mzahran@cs.nyu.edu
More informationCSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA
CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA Sreepathi Pai October 18, 2017 URCS Outline Background Memory Code Execution Model Outline Background Memory Code Execution Model
More informationHPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming
KFUPM HPC Workshop April 29-30 2015 Mohamed Mekias HPC Solutions Consultant Introduction to CUDA programming 1 Agenda GPU Architecture Overview Tools of the Trade Introduction to CUDA C Patterns of Parallel
More informationAdvanced CUDA Optimizations. Umar Arshad ArrayFire
Advanced CUDA Optimizations Umar Arshad (@arshad_umar) ArrayFire (@arrayfire) ArrayFire World s leading GPU experts In the industry since 2007 NVIDIA Partner Deep experience working with thousands of customers
More informationGPU Computing: Introduction to CUDA. Dr Paul Richmond
GPU Computing: Introduction to CUDA Dr Paul Richmond http://paulrichmond.shef.ac.uk This lecture CUDA Programming Model CUDA Device Code CUDA Host Code and Memory Management CUDA Compilation Programming
More informationCS179 GPU Programming Recitation 4: CUDA Particles
Recitation 4: CUDA Particles Lab 4 CUDA Particle systems Two parts Simple repeat of Lab 3 Interacting Flocking simulation 2 Setup Two folders given particles_simple, particles_interact Must install NVIDIA_CUDA_SDK
More informationAn Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture
An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture Rafia Inam Mälardalen Real-Time Research Centre Mälardalen University, Västerås, Sweden http://www.mrtc.mdh.se rafia.inam@mdh.se CONTENTS
More informationModule 2: Introduction to CUDA C. Objective
ECE 8823A GPU Architectures Module 2: Introduction to CUDA C 1 Objective To understand the major elements of a CUDA program Introduce the basic constructs of the programming model Illustrate the preceding
More informationAn Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center
An Introduction to GPU Architecture and CUDA C/C++ Programming Bin Chen April 4, 2018 Research Computing Center Outline Introduction to GPU architecture Introduction to CUDA programming model Using the
More informationGPU programming CUDA C. GPU programming,ii. COMP528 Multi-Core Programming. Different ways:
COMP528 Multi-Core Programming GPU programming,ii www.csc.liv.ac.uk/~alexei/comp528 Alexei Lisitsa Dept of computer science University of Liverpool a.lisitsa@.liverpool.ac.uk Different ways: GPU programming
More informationIntroduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research
Introduction to CUDA CME343 / ME339 18 May 2011 James Balfour [ jbalfour@nvidia.com] NVIDIA Research CUDA Programing system for machines with GPUs Programming Language Compilers Runtime Environments Drivers
More informationParallel Computing. Lecture 19: CUDA - I
CSCI-UA.0480-003 Parallel Computing Lecture 19: CUDA - I Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com GPU w/ local DRAM (device) Behind CUDA CPU (host) Source: http://hothardware.com/reviews/intel-core-i5-and-i7-processors-and-p55-chipset/?page=4
More informationCS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in
More informationOpenACC Course. Office Hour #2 Q&A
OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle
More informationCS691/SC791: Parallel & Distributed Computing
CS691/SC791: Parallel & Distributed Computing Introduction to OpenMP 1 Contents Introduction OpenMP Programming Model and Examples OpenMP programming examples Task parallelism. Explicit thread synchronization.
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationVector Addition on the Device: main()
Vector Addition on the Device: main() #define N 512 int main(void) { int *a, *b, *c; // host copies of a, b, c int *d_a, *d_b, *d_c; // device copies of a, b, c int size = N * sizeof(int); // Alloc space
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control
More informationCUDA (Compute Unified Device Architecture)
CUDA (Compute Unified Device Architecture) Mike Bailey History of GPU Performance vs. CPU Performance GFLOPS Source: NVIDIA G80 = GeForce 8800 GTX G71 = GeForce 7900 GTX G70 = GeForce 7800 GTX NV40 = GeForce
More informationAutomatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo
Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015 Goals Running CUDA code on CPUs. Why? Performance portability! A major challenge faced
More informationModule 2: Introduction to CUDA C
ECE 8823A GPU Architectures Module 2: Introduction to CUDA C 1 Objective To understand the major elements of a CUDA program Introduce the basic constructs of the programming model Illustrate the preceding
More informationCUDA Performance Optimization. Patrick Legresley
CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations
More informationOverview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
More informationCh. 3: The C in C++ - Continued -
Ch. 3: The C in C++ - Continued - QUIZ What are the 3 ways a reference can be passed to a C++ function? QUIZ True or false: References behave like constant pointers with automatic dereferencing. QUIZ What
More informationCUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.
Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication
More informationGPU Programming Using CUDA. Samuli Laine NVIDIA Research
GPU Programming Using CUDA Samuli Laine NVIDIA Research Today GPU vs CPU Different architecture, different workloads Basics of CUDA Executing code on GPU Managing memory between CPU and GPU CUDA API Quick
More informationINTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro
INTRODUCTION TO GPU COMPUTING WITH CUDA Topi Siro 19.10.2015 OUTLINE PART I - Tue 20.10 10-12 What is GPU computing? What is CUDA? Running GPU jobs on Triton PART II - Thu 22.10 10-12 Using libraries Different
More informationECE 574 Cluster Computing Lecture 17
ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 March 2019 HW#8 (CUDA) posted. Project topics due. Announcements 1 CUDA installing On Linux
More informationCUDA Programming Model
CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming
More informationCUDA OPTIMIZATIONS ISC 2011 Tutorial
CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationCUDA C/C++ BASICS. NVIDIA Corporation
CUDA C/C++ BASICS NVIDIA Corporation What is CUDA? CUDA Architecture Expose GPU parallelism for general-purpose computing Retain performance CUDA C/C++ Based on industry-standard C/C++ Small set of extensions
More informationGPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh
GPU Programming EPCC The University of Edinburgh Contents NVIDIA CUDA C Proprietary interface to NVIDIA architecture CUDA Fortran Provided by PGI OpenCL Cross platform API 2 NVIDIA CUDA CUDA allows NVIDIA
More informationCUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17
CUDA Lecture 2 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de December 15, 2015 CUDA Programming Fundamentals CUDA
More informationSoftware Engineering /48
Software Engineering 1 /48 Topics 1. The Compilation Process and You 2. Polymorphism and Composition 3. Small Functions 4. Comments 2 /48 The Compilation Process and You 3 / 48 1. Intro - How do you turn
More informationMODELING CUDA COMPUTE APPLICATIONS BY CRITICAL PATH. PATRIC ZHAO, JIRI KRAUS, SKY WU
MODELING CUDA COMPUTE APPLICATIONS BY CRITICAL PATH PATRIC ZHAO, JIRI KRAUS, SKY WU patricz@nvidia.com AGENDA Background Collect data and Visualizations Critical Path Performance analysis and prediction
More informationCS201 - Introduction to Programming Glossary By
CS201 - Introduction to Programming Glossary By #include : The #include directive instructs the preprocessor to read and include a file into a source code file. The file name is typically enclosed with
More informationLecture 2: Introduction to CUDA C
CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013 1 CUDA /OpenCL Execution Model Integrated host+device app C program Serial or
More informationECE 574 Cluster Computing Lecture 15
ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements
More informationGPU Programming Using CUDA. Samuli Laine NVIDIA Research
GPU Programming Using CUDA Samuli Laine NVIDIA Research Today GPU vs CPU Different architecture, different workloads Basics of CUDA Executing code on GPU Managing memory between CPU and GPU CUDA API Quick
More informationStream Computing using Brook+
Stream Computing using Brook+ School of Electrical Engineering and Computer Science University of Central Florida Slides courtesy of P. Bhaniramka Outline Overview of Brook+ Brook+ Software Architecture
More informationDynamic Cuda with F# HPC GPU & F# Meetup. March 19. San Jose, California
Dynamic Cuda with F# HPC GPU & F# Meetup March 19 San Jose, California Dr. Daniel Egloff daniel.egloff@quantalea.net +41 44 520 01 17 +41 79 430 03 61 About Us! Software development and consulting company!
More informationCUDA Parallel Programming Model. Scalable Parallel Programming with CUDA
CUDA Parallel Programming Model Scalable Parallel Programming with CUDA Some Design Goals Scale to 100s of cores, 1000s of parallel threads Let programmers focus on parallel algorithms not mechanics of
More informationPractical Introduction to CUDA and GPU
Practical Introduction to CUDA and GPU Charlie Tang Centre for Theoretical Neuroscience October 9, 2009 Overview CUDA - stands for Compute Unified Device Architecture Introduced Nov. 2006, a parallel computing
More informationShort Notes of CS201
#includes: Short Notes of CS201 The #include directive instructs the preprocessor to read and include a file into a source code file. The file name is typically enclosed with < and > if the file is a system
More informationReductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research
Reductions and Low-Level Performance Considerations CME343 / ME339 27 May 2011 David Tarjan [dtarjan@nvidia.com] NVIDIA Research REDUCTIONS Reduction! Reduce vector to a single value! Via an associative
More informationLecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1
Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei
More informationScientific discovery, analysis and prediction made possible through high performance computing.
Scientific discovery, analysis and prediction made possible through high performance computing. An Introduction to GPGPU Programming Bob Torgerson Arctic Region Supercomputing Center November 21 st, 2013
More informationCSE 591: GPU Programming. Programmer Interface. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591: GPU Programming Programmer Interface Klaus Mueller Computer Science Department Stony Brook University Compute Levels Encodes the hardware capability of a GPU card newer cards have higher compute
More informationCUDA Basics. July 6, 2016
Mitglied der Helmholtz-Gemeinschaft CUDA Basics July 6, 2016 CUDA Kernels Parallel portion of application: execute as a kernel Entire GPU executes kernel, many threads CUDA threads: Lightweight Fast switching
More informationSpeed Up Your Codes Using GPU
Speed Up Your Codes Using GPU Wu Di and Yeo Khoon Seng (Department of Mechanical Engineering) The use of Graphics Processing Units (GPU) for rendering is well known, but their power for general parallel
More informationKernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow
Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization
More informationCUDA C/C++ BASICS. NVIDIA Corporation
CUDA C/C++ BASICS NVIDIA Corporation What is CUDA? CUDA Architecture Expose GPU parallelism for general-purpose computing Retain performance CUDA C/C++ Based on industry-standard C/C++ Small set of extensions
More informationCUDA Parallel Programming Model Michael Garland
CUDA Parallel Programming Model Michael Garland NVIDIA Research Some Design Goals Scale to 100s of cores, 1000s of parallel threads Let programmers focus on parallel algorithms not mechanics of a parallel
More informationFundamental Optimizations
Fundamental Optimizations Paulius Micikevicius NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access
More informationCUDA 7.5 OVERVIEW WEBINAR 7/23/15
CUDA 7.5 OVERVIEW WEBINAR 7/23/15 CUDA 7.5 https://developer.nvidia.com/cuda-toolkit 16-bit Floating-Point Storage 2x larger datasets in GPU memory Great for Deep Learning cusparse Dense Matrix * Sparse
More informationQUIZ. What are 3 differences between C and C++ const variables?
QUIZ What are 3 differences between C and C++ const variables? Solution QUIZ Source: http://stackoverflow.com/questions/17349387/scope-of-macros-in-c Solution The C/C++ preprocessor substitutes mechanically,
More informationSupercomputing, Tutorial S03 New Orleans, Nov 14, 2010
Fundamental Optimizations Paulius Micikevicius NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access
More informationTechnische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics
GPU Programming Rüdiger Westermann Chair for Computer Graphics & Visualization Faculty of Informatics Overview Programming interfaces and support libraries The CUDA programming abstraction An in-depth
More informationCS179 GPU Programming: CUDA Memory. Lecture originally by Luke Durant and Tamas Szalay
: CUDA Memory Lecture originally by Luke Durant and Tamas Szalay CUDA Memory Review of Memory Spaces Memory syntax Constant Memory Allocation Issues Global Memory Gotchas Shared Memory Gotchas Texture
More informationIntroduction to Scientific Programming using GPGPU and CUDA
Introduction to Scientific Programming using GPGPU and CUDA Day 1 Sergio Orlandini s.orlandini@cineca.it Mario Tacconi m.tacconi@cineca.it 0 Hands on: Compiling a CUDA program Environment and utility:
More informationMemory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.
Memory concept Grid concept, Synchronization GPU Programming http://cuda.nik.uni-obuda.hu Szénási Sándor szenasi.sandor@nik.uni-obuda.hu GPU Education Center of Óbuda University MEMORY CONCEPT Off-chip
More informationWriting and compiling a CUDA code
Writing and compiling a CUDA code Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) Writing CUDA code 1 / 65 The CUDA language If we want fast code, we (unfortunately)
More informationPROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec
PROGRAMOVÁNÍ V C++ CVIČENÍ Michal Brabec PARALLELISM CATEGORIES CPU? SSE Multiprocessor SIMT - GPU 2 / 17 PARALLELISM V C++ Weak support in the language itself, powerful libraries Many different parallelization
More informationBlocks, Grids, and Shared Memory
Blocks, Grids, and Shared Memory GPU Course, Fall 2012 Last week: ax+b Homework Threads, Blocks, Grids CUDA threads are organized into blocks Threads operate in SIMD(ish) manner -- each executing same
More informationSupporting Data Parallelism in Matcloud: Final Report
Supporting Data Parallelism in Matcloud: Final Report Yongpeng Zhang, Xing Wu 1 Overview Matcloud is an on-line service to run Matlab-like script on client s web browser. Internally it is accelerated by
More informationCUDA Programming. Aiichiro Nakano
CUDA Programming Aiichiro Nakano Collaboratory for Advanced Computing & Simulations Department of Computer Science Department of Physics & Astronomy Department of Chemical Engineering & Materials Science
More informationInformation Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY
Introduction to CUDA Ingemar Ragnemalm Information Coding, ISY This lecture: Programming model and language Introduction to memory spaces and memory access Shared memory Matrix multiplication example Lecture
More informationIntroduction to Parallel Computing with CUDA. Oswald Haan
Introduction to Parallel Computing with CUDA Oswald Haan ohaan@gwdg.de Schedule Introduction to Parallel Computing with CUDA Using CUDA CUDA Application Examples Using Multiple GPUs CUDA Application Libraries
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More information04 Sharing Code Between Windows 8 and Windows Phone 8 in Visual Studio. Ben Riga
04 Sharing Code Between Windows 8 and Windows Phone 8 in Visual Studio Ben Riga http://about.me/ben.riga Course Topics Building Apps for Both Windows 8 and Windows Phone 8 Jump Start 01 Comparing Windows
More informationFixed-Point Math and Other Optimizations
Fixed-Point Math and Other Optimizations Embedded Systems 8-1 Fixed Point Math Why and How Floating point is too slow and integers truncate the data Floating point subroutines: slower than native, overhead
More informationHands-on CUDA exercises
Hands-on CUDA exercises CUDA Exercises We have provided skeletons and solutions for 6 hands-on CUDA exercises In each exercise (except for #5), you have to implement the missing portions of the code Finished
More informationMassively Parallel Computing with CUDA. Carlos Alberto Martínez Angeles Cinvestav-IPN
Massively Parallel Computing with CUDA Carlos Alberto Martínez Angeles Cinvestav-IPN What is a GPU? A graphics processing unit (GPU) The term GPU was popularized by Nvidia in 1999 marketed the GeForce
More informationECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications
ECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications The NVIDIA GPU Memory Ecosystem Atomic operations in CUDA The thrust library October 7, 2015 Dan Negrut, 2015 ECE/ME/EMA/CS 759
More informationIdentifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011
Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011 Performance Optimization Process Use appropriate performance metric for each kernel For example, Gflops/s don t make sense for
More informationGPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34
1 / 34 GPU Programming Lecture 2: CUDA C Basics Miaoqing Huang University of Arkansas 2 / 34 Outline Evolvements of NVIDIA GPU CUDA Basic Detailed Steps Device Memories and Data Transfer Kernel Functions
More informationHands-on CUDA Optimization. CUDA Workshop
Hands-on CUDA Optimization CUDA Workshop Exercise Today we have a progressive exercise The exercise is broken into 5 steps If you get lost you can always catch up by grabbing the corresponding directory
More informationCompiling CUDA and Other Languages for GPUs. Vinod Grover and Yuan Lin
Compiling CUDA and Other Languages for GPUs Vinod Grover and Yuan Lin Agenda Vision Compiler Architecture Scenarios SDK Components Roadmap Deep Dive SDK Samples Demos Vision Build a platform for GPU computing
More informationCS 326 Operating Systems C Programming. Greg Benson Department of Computer Science University of San Francisco
CS 326 Operating Systems C Programming Greg Benson Department of Computer Science University of San Francisco Why C? Fast (good optimizing compilers) Not too high-level (Java, Python, Lisp) Not too low-level
More informationMassively Parallel Algorithms
Massively Parallel Algorithms Introduction to CUDA & Many Fundamental Concepts of Parallel Programming G. Zachmann University of Bremen, Germany cgvr.cs.uni-bremen.de Hybrid/Heterogeneous Computation/Architecture
More informationP.G.TRB - COMPUTER SCIENCE. c) data processing language d) none of the above
P.G.TRB - COMPUTER SCIENCE Total Marks : 50 Time : 30 Minutes 1. C was primarily developed as a a)systems programming language b) general purpose language c) data processing language d) none of the above
More informationLecture 6b Introduction of CUDA programming
CS075 1896 1920 1987 2006 Lecture 6b Introduction of CUDA programming 0 1 0, What is CUDA? CUDA Architecture Expose GPU parallelism for general-purpose computing Retain performance CUDA C/C++ Based on
More informationTesla Architecture, CUDA and Optimization Strategies
Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization
More informationCUDA C/C++ Basics GTC 2012 Justin Luitjens, NVIDIA Corporation
CUDA C/C++ Basics GTC 2012 Justin Luitjens, NVIDIA Corporation What is CUDA? CUDA Platform Expose GPU computing for general purpose Retain performance CUDA C/C++ Based on industry-standard C/C++ Small
More informationUniversity of Bielefeld
Geistes-, Natur-, Sozial- und Technikwissenschaften gemeinsam unter einem Dach Introduction to GPU Programming using CUDA Olaf Kaczmarek University of Bielefeld STRONGnet Summerschool 2011 ZIF Bielefeld
More informationFundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA
Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU
More informationOptimizing CUDA for GPU Architecture. CSInParallel Project
Optimizing CUDA for GPU Architecture CSInParallel Project August 13, 2014 CONTENTS 1 CUDA Architecture 2 1.1 Physical Architecture........................................... 2 1.2 Virtual Architecture...........................................
More informationSTUDY NOTES UNIT 1 - INTRODUCTION TO OBJECT ORIENTED PROGRAMMING
OBJECT ORIENTED PROGRAMMING STUDY NOTES UNIT 1 - INTRODUCTION TO OBJECT ORIENTED PROGRAMMING 1. Object Oriented Programming Paradigms 2. Comparison of Programming Paradigms 3. Basic Object Oriented Programming
More informationNumbaPro CUDA Python. Square matrix multiplication
NumbaPro Enables parallel programming in Python Support various entry points: Low-level (CUDA-C like) programming language High-level array oriented interface CUDA library bindings Also support multicore
More informationCS377P Programming for Performance GPU Programming - I
CS377P Programming for Performance GPU Programming - I Sreepathi Pai UTCS November 9, 2015 Outline 1 Introduction to CUDA 2 Basic Performance 3 Memory Performance Outline 1 Introduction to CUDA 2 Basic
More informationParallel Programming on Larrabee. Tim Foley Intel Corp
Parallel Programming on Larrabee Tim Foley Intel Corp Motivation This morning we talked about abstractions A mental model for GPU architectures Parallel programming models Particular tools and APIs This
More informationLecture 1: an introduction to CUDA
Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Overview hardware view software view CUDA programming
More informationMemory. Lecture 2: different memory and variable types. Memory Hierarchy. CPU Memory Hierarchy. Main memory
Memory Lecture 2: different memory and variable types Prof. Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Key challenge in modern computer architecture
More informationCSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591: GPU Programming Using CUDA in Practice Klaus Mueller Computer Science Department Stony Brook University Code examples from Shane Cook CUDA Programming Related to: score boarding load and store
More informationAn Introduction to GPU Computing and CUDA Architecture
An Introduction to GPU Computing and CUDA Architecture Sarah Tariq, NVIDIA Corporation GPU Computing GPU: Graphics Processing Unit Traditionally used for real-time rendering High computational density
More informationC++ for System Developers with Design Pattern
C++ for System Developers with Design Pattern Introduction: This course introduces the C++ language for use on real time and embedded applications. The first part of the course focuses on the language
More informationCUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN
CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction Francesco Rossi University of Bologna and INFN * Using this terminology since you ve already heard of SIMD and SPMD at this school
More informationGPU Programming with Ateji PX June 8 th Ateji All rights reserved.
GPU Programming with Ateji PX June 8 th 2010 Ateji All rights reserved. Goals Write once, run everywhere, even on a GPU Target heterogeneous architectures from Java GPU accelerators OpenCL standard Get
More information