Developing Portable CUDA C/C++ Code with Hemi

Size: px
Start display at page:

Download "Developing Portable CUDA C/C++ Code with Hemi"


1 Developing Portable CUDA C/C++ Code with Hemi Software development is as much about writing code fast as it is about writing fast code, and central to rapid development is software reuse and portability. When building heterogeneous applications, developers must be able to share code between projects, platforms, compilers, and target architectures. Ideally, libraries of domain-specific code should be easily retarget able. In this post I ll talk about Hemi, a simple open-source C++ header library that simplifies writing portable CUDA C/C++ code. In the screenshot below, both columns show a simple Black-Scholes code written to be compliable with either NVCC or a standard C++ host compiler, and also run able on either the CPU or a CUDA GPU. The right column is written using Hemi s macros and smart heterogeneous Array container class, hemi::array. Using Hemi, the length and complexity of this code is reduced by half. Portable CUDA C++ code without Hemi (left) and with Hemi (right). (Click for full resolution) CUDA C++ and the NVIDIA NVCC compiler tool chain provide a number of features designed to make it easier to write portable code, including language-level integration of host and device code and data, declaration specifies (e.g. host and device ) and preprocessor definitions (e.g. CUDACC ). Together, these features enable developers to write code that can be compiled and run on either the host, the device, or both. But as the left column above shows, using them directly can result in complicated code. One cause of this is the code duplication that is required to support multiple target platforms, and another cause is the verbose memory management incurred by heterogeneous memory spaces. Hemi aims to tackle both problems. Hemi is inspired by real-world CUDA software projects like PhysX and OptiX, which use custom libraries of preprocessor macros and container classes that enable the definition of portable application-specific libraries, classes, and kernels. PhysX, for example, has a comprehensive 3D vector math library that is portable across multiple platforms, including CUDA GPUs, Intel and other CPUs, and game consoles. To make CUDA memory management and transfers robust and simple to implement, PhysX uses a smart generic array class that automatically copies data between the device and host only when necessary. The result is much like the right-hand side of the screenshot above, with a minimum of memory management code and no explicit memory copies. In this post I ll describe Hemi in depth, but first I want to cover the CUDA C/C++ language and compiler features on which Hemi is built.

2 CUDA C++ LANGUAGE INTEGRATION AND PORTABILITY FEATURES Host / Device Functions If you are already programming in CUDA C/C++ then you are familiar with device, the declaration specifier that indicates a function that is callable from other device functions and kernel ( global ) functions. CUDA also provides the host declaration specifier for host (CPU) functions, which is the default in the absence of a specifier. Often we need to execute exactly the same code on the CPU and GPU, and in those cases we need to write functions that are callable from either the host or the device. In that case, host and device can be combined, as shown in the following inline function that averages two floats. host device inline float avgf(float x, float y) { return (x+y)/2.0f; When NVCC sees this function, it generates two versions of the code, one for the host and one for the device. Any calls to the function from device code will execute the device version, and any calls from host code will execute the host version. This host device combination is very powerful because it enables large utility code bases to be used across heterogeneous applications, minimizing the work required to port applications. However, other compilers (obviously) don t recognize these declaration specifiers, so to really write portable code, we need to use the C preprocessor. CUDA Preprocessor Definitions At compile time NVCC defines several macros that can be used to selectively enable and disable code based on whether it is being compiled by NVCC, whether it is device code or host code, and based on the architecture version (also called compute capability) it is being compiled for. NVCC Can be used in C/C++/CUDA source files to test whether they are currently being compiled by nvcc. CUDACC Can be used in source files to test whether they are being treated as CUDA source files by nvcc. CUDA_ARCH This architecture identification macro is assigned a three-digit value string xy0 (ending in a literal 0) when compiling device code compute_xy. For example, when compiling device code for compute_20 (or sm_20), CUDA_ARCH will be defined bynvcc to the value 200. This macro can be used in the implementation of device and kernel functions to determine the virtual architecture for which it is currently being compiled. Host code must not depend on this macro, but note that it is not defined

3 when host code is being compiled, which means that it can be used to detect compilation of device code. The following example combines declaration specifiers and preprocessor macros to write a portable routine for counting the number of bits that are set in a 32-bit word. #ifdef CUDACC host device #endif int countleadingzeros(unsigned int a) { #if defined( CUDA_ARCH ) return popc(a); #else // Source: a = a - ((a >> 1) & 0x ); a = (a & 0x ) + ((a >> 2) & 0x ); return ((a + (a >> 4) & 0xF0F0F0F) * 0x ) >> 24; #endif Here I have defined a function countsetbits that is callable from either host or device code and due to the check for CUDACC wrapping host device, it is compilable using NVCC or other C/C++ compilers. Whether or not it is compiled with NVCC, it uses arithmetic on the CPU to count the 1 bits. On the device, it uses CUDA s built-in popc() intrinsic. If you look in CUDA s device_functions.h header file, you ll see that the value of CUDA_ARCH is used to further differentiate; on Fermi and later GPUs (sm20, CUDA_ARCH == 200) popc() generates a single hardware population count instruction, while on earlier architectures it uses code similar to the host code. HEMI: EASIER PORTABLE CODE As you can see, CUDA makes writing portable code feasible and flexible, but doing so is not particularly simple. Hemi,available on Github, provides just two simple header files (and a few examples) that make the task much easier, with much clearer code. The hemi.h header provides simple macros that are useful for reusing code between CUDA C/C++ and C/C++ written for other platforms (e.g. CPUs). The macros are used to decorate function prototypes and variable declarations so that they can be compiled by either NVCC or a host compiler (for example gcc or cl.exe, the MS Visual Studio compiler). The macros can be used,.cuh,.cpp,.h, and.inl files to define code that can be compiled either for the host or the device. Before diving into the features of Hemi, let me draw your attention to the Hemi examples.

4 blackscholes: This is a simple example that performs a Black-Scholes options pricing calculation using code that is entirely shared between host code compiled with any C/C++ compiler (including NVCC) and device code that is compiled with NVCC. When compiled with nvcc -x cu (to force CUDA compilation on the.cpp file), this runs on the GPU. When compiled with nvcc or g++ it runs on the host. blackscholes_nohemi: Just like the above, except it doesn t use Hemi. This is just to demonstrate the complexity that Hemi eliminates. blackscholes_hostdevice: This example demonstrates how to write portable code that can be compiled to run the same code on both the host and device, in a single compile & run. This increase in run-time flexibility has a slight complexity cost, but all of the core computational code is reused. blackscholes_hemiarray: This example is the same as the blackscholes example, except that it uses hemi::array to encapsulate CUDA-specific memory management code, and eliminate all explicit host-device memory copy code. nbody_vec4: This example brings all of Hemi s features together. It implements a simple all-pairs n-body gravitational force calculation using a 4D vector class called Vec4f, which uses Hemi macros to enable all of the code for the class to be shared between host code compiled by the host compiler and device or host code compiled with NVCC. nbody_vec4 also shares most of the all-pairs gravitational force calculation code between device and host, and demonstrates how optimized device implementations (e.g. using shared memory) can be substituted as needed. Finally, this sample also uses hemi::array to simplify memory management and data transfers. HEMI PORTABLE FUNCTIONS A typical use for host-device code sharing is commonly used utility functions. For example, here is a portable version of our earlier example function that averages two floats. HEMI_DEV_CALLABLE_INLINE float avgf(float x, float y) { return (x+y)/2.0f; This function can be called either from host code or device code, and can be compiled by either the host compiler or NVCC. The macro definition ensures that when compiled by NVCC, both a host and device version of the function are generated, and a normal inline function is generated when compiled by the host compiler. For another example use, see the CND() function defined in the blackscholes example included with Hemi, as well as several other functions used in the examples.

5 HEMI PORTABLE CLASSES The HEMI_DEV_CALLABLE_MEMBER and HEMI_DEV_CALLABLE_INLINE_MEMB ER macros can be used to create classes that are reusable between host and device code, by decorating any member function prototype that will be used by both device and host code. Here is an example excerpt of a portable class (a 4D vector type used in the nbody_vec4 example). struct HEMI_ALIGN(16) Vec4f { float x, y, z, w; HEMI_DEV_CALLABLE_INLINE_MEMBER Vec4f() {; HEMI_DEV_CALLABLE_INLINE_MEMBER Vec4f(float xx, float yy, float zz, float ww) : x(xx), y(yy), z(zz), w(ww) { HEMI_DEV_CALLABLE_INLINE_MEMBER Vec4f(const Vec4f& v) : x(v.x), y(v.y), z(v.z), w(v.w) { HEMI_DEV_CALLABLE_INLINE_MEMBER Vec4f& operator=(const Vec4f& v) { x = v.x; y = v.y; z = v.z; w = v.w; return *this; HEMI_DEV_CALLABLE_INLINE_MEMBER Vec4f operator+(const Vec4f& v) const { return Vec4f(x+v.x, y+v.y, z+v.z, w+v.w);... ; The HEMI_ALIGN macro is used on types that will be passed in arrays or pointers as arguments to CUDA device kernel functions, to ensure proper alignment. HEMI_ALIGN generates correct alignment specifiers for host compilers, too. For details on alignment, see the NVIDIA CUDA C Programming Guide (Section 5.3 in v5.0). NOTE: DEVICE-SPECIFIC CODE

6 Code in functions declared with HEMI_DEV_CALLABLE_* must be portable. In other words it must compile and run correctly for both the host and the device. If it does not, within the function you can use HEMI_DEV_CODE (which reduces to CUDA_ARCH ) to define separate code for host and device, as in the following example. HEMI_DEV_CALLABLE_INLINE_MEMBER float inverselength(float softening = 0.0f) const { #ifdef HEMI_DEV_CODE return rsqrtf(lengthsqr() + softening); // use fast GPU intrinsic #else return 1.0f / sqrtf(lengthsqr() + softening); #endif If you need to write a function only for the device, use the CUDA C device specifier directly. Note: Non-inline functions and methods Take care when using the non-inline versions of the declaration specifier macros (HEMI_DEV_CALLABLE andhemi_dev_callable_member) to avoid multiple definition linker errors due to using these in headers that are included into multiple compilation units. The best way to use HEMI_DEV_CALLABLE is to declare functions using this macro in a header, and define their implementation in file, and compile it with NVCC. This will generate code for both host and device. The host code will be linked into your library or application and callable from other host code compilation units (.c and.cpp files). Likewise, for HEMI_DEV_CALLABLE_MEMBER, put the class and function declaration in a header, and the member function implementations in file, compiled by NVCC. HEMI PORTABLE KERNELS Use HEMI_KERNEL to declare functions that are launchable as CUDA kernels when compiled with NVCC, or callable as C/C++ (host) functions when compiled with the host compiler. HEMI_KERNEL_LAUNCH is a convenience macro that launches a kernel function on the device when compiled with NVCC, or calls the host function when compiled with the host compiler. For example, here is an excerpt from the blackscholes example, which is a single.cpp file that can be either compiled with NVCC to run on the GPU, or compiled with the host compiler to run on the CPU. // Black-Scholes formula for both call and put

7 HEMI_KERNEL(BlackScholes) (float *callresult, float *putresult, const float *stockprice, const float *optionstrike, const float *optionyears, float Riskfree, float Volatility, int optn) {... //... in main()... HEMI_KERNEL_LAUNCH(BlackScholes, griddim, blockdim, 0, 0, d_callresult, d_putresult, d_stockprice, d_optionstrike, d_optionyears, RISKFREE, VOLATILITY, OPT_N); HEMI_KERNEL_LAUNCH requires grid and block dimensions to be passed to it, but these parameters are ignored when compiled for the host. When DEBUG is defined, HEMI_KERNEL_LAUNCH checks for CUDA launch and run-time errors. You can use HEMI_KERNEL_NAME to access the generated name of the kernel function, for example to pass a function pointer to CUDA API functions like cudafuncgetattributes(). Iteration For kernel functions with simple independent element-wise parallelism, Hemi provides two functions to enable iterating over elements sequentially in host code or in parallel in device code. hemigetelementoffset() returns the offset of the current thread within the 1D grid, or zero for host code. In device code, it resolves to blockdim.x * blockidx.x + threadidx.x. hemigetelementstride() returns the size of the 1D grid in threads, or one in host code. In device code, it resolves to griddim.x * blockdim.x. The blackscholes example demonstrates iteration in the following function, which can be compiled and run as a sequential function on the host or as a CUDA kernel on the device. // Black-Scholes formula for both call and put HEMI_KERNEL(BlackScholes) (float *callresult, float *putresult, const float *stockprice, const float *optionstrike, const float *optionyears, float Riskfree, float Volatility, int optn) { int offset = hemigetelementoffset(); int stride = hemigetelementstride(); for(int opt = offset; opt < optn; opt += stride)

8 { //... compute call and put value based on Black-Scholes formula Note: the hemigetelement*() functions are specialized to simple (but common) element-wise parallelism. As such, they may not be useful for arbitrary strides, data sharing, or other more complex parallelism arrangements; but they may serve as examples for creating your own. HEMI PORTABLE CONSTANTS Global constant values can be defined using the HEMI_DEFINE_CONSTANT macro, which takes a name and an initial value. When compiled with NVCC as CUDA code, this declares two versions of the constant, one constant variable for the device, and one normal host variable. When compiled with a host compiler, only the host variable is defined. For static or external linkage, use the HEMI_DEFINE_STATIC_CONSTANT andhemi_define_extern_constant versions of the macro, respectively. To access variables defined usinghemi_define_*_constant macros, use the HEMI_CONSTANT macro which automatically resolves to either the device or host constant depending on whether it is called from device or host code. This means that the proper variable will be chosen when the constant is accessed within functions declared with HEMI_DEV_CALLABLE_* andhemi_kernel macros. To explicitly access the device version of a constant, use HEMI_DEV_CONSTANT. This is useful when the constant is an argument to a CUDA API function such as cudamemcpytosymbol, as shown in the following code from the nbody_vec4 example. cudamemcpytosymbol(hemi_dev_constant(softeningsquared), &ss, sizeof(float), 0, cudamemcpyhosttodevice) HEMI PORTABLE DATA: HEMI::ARRAY One of the biggest challenges in writing portable CUDA code is memory management. Hemi provides thehemi::array C++ template class (defined in hemi/array.h), a simple data management container which allows arrays of arbitrary type to be created and used with both host and device code. hemi::array maintains a host and a device pointer for each array. It lazily transfers data between the host and device as needed when the user requests a pointer to the host or device memory. Pointer requests specify read-only, read/write, or write-only options to

9 keep the valid location of data up-to-date and only copy data when the requested pointer is invalid. hemi::array supports pinned host memory for efficient PCI-express transfers, and handles CUDA error checking internally. Here is an excerpt from the nbody_vec4 example. hemi::array<vec4f> bodies(n, true); hemi::array<vec4f> forcevectors(n, true); randomizebodies(bodies.writeonlyhostptr(), N); // Call host function defined in a.cpp compilation unit allpairsforceshost(forcevectors.writeonlyhostptr(), bodies.readonlyhostptr(), N); printf("cpu: Force vector 0: (%0.3f, %0.3f, %0.3f)\n", forcevectors.readonlyhostptr()[0].x, forcevectors.readonlyhostptr()[0].y, forcevectors.readonlyhostptr()[0].z);... // Call device function defined in compilation unit // that uses host/device shared functions and class member functions allpairsforcescuda(forcevectors.writeonlydeviceptr(), bodies.readonlydeviceptr(), N, false); printf("gpu: Force vector 0: (%0.3f, %0.3f, %0.3f)\n", forcevectors.readonlyhostptr()[0].x, forcevectors.readonlyhostptr()[0].y, forcevectors.readonlyhostptr()[0].z); Typical CUDA code requires explicit duplication of host allocations on the device, and explicit copy calls between them, along with error checking for all allocations and transfers. The blackscholes_hemiarray example demonstrates how much hemi::array simplifies CUDA C code, doing with 136 lines of code what the blackscholes example does in 180 lines. HEMI CUDA ERROR CHECKING hemi.h provides two convenience functions for checking CUDA errors. checkcuda verifies that its single argument has the value cudasuccess, and otherwise prints an error message and asserts if #DEBUG is defined. This function is typically wrapped around CUDA API calls, as in the following. checkcuda( cudamemcpy(d_price, price, OPT_SIZE, cudamemcpyhosttodevice) );

10 checkcudaerrors takes no arguments and checks the current state of the CUDA context for errors. This function synchronizes the CUDA device (cudadevicesynchronize()) to ensure asynchronous launch errors are caught. BothcheckCuda and checkcudaerrors act as No-ops when DEBUG is not defined (release builds). SUMMARY: MIX AND MATCH I designed Hemi to provide a loosely-coupled set of utilities and examples for creating reusable, portable CUDA C/C++ code. Feel free to use the parts that you need and ignore others, or modify and replace portions to suit the needs of your projects. Or just use it as an example and develop your own utilities for writing flexible and portable CUDA code. If you make changes that you feel would be generally useful, please fork the project on github, commit your changes, and submit a pull request! If you would like to give feedback about Hemi, please contact me using the contact form or by filing an issue on Github. Share functionality using Portable Class Libraries Applies to: Windows Phone 8 Windows 8 This topic explains what a Portable Class Library is and how you can use it to share code between your apps for Windows Phone 8 and Windows 8. This topic contains the following sections. What is a Portable Class Library? How to use a Portable Class Library in your app for Windows Phone 8 and Windows 8 What to share in a Portable Class Library Portable Class Libraries and MVVM Related Topics What is a Portable Class Library? Portable Class Libraries have been available Framework 4. You can use them to create portable assemblies that can target multiple platforms, including Windows 7, Windows 8, Windows Phone, Silverlight, and Xbox 360, as demonstrated in the following image. Portable Class Libraries support a subset assemblies that target the platforms you choose. Visual Studio 2012 Pro and greater versions come with a project template that you can use to create Portable Class Libraries. This is a very good way to reduce time and cost by sharing functionality across the platforms you want to support.

11 How to use a Portable Class Library in your app for Windows Phone 8 and Windows 8 You can use Portable Class Libraries to share functionality between your apps for Windows Phone 8 and Windows 8. Note that the Express versions of Visual Studio 2012 don t include a Portable Class Library project template. The template is available only in Visual Studio 2012 Pro or greater versions. The following diagram demonstrates how both apps share a Portable Class Library. To reference a Portable Class Library, in Solution Explorer, select your project, and then choose Add Reference. Point to either a binary of the Portable Class Library or to the Portable Class Library project. What to share in a Portable Class Library

12 When you create your app for Windows Phone 8 and Windows 8, you should identify portable code. Place this code in a Portable Class Library and share the portable library between both apps. Portable code has the following characteristics: Managed (C# or VB) code Portable Class Libraries is concept and supports managed code only. Because Windows Phone 8 and Windows 8 share the engine, a lot of the managed code you write, particularly app logic, has the potential to be portable. Doesn t use conditional compilation A Portable Class Library is compiled against a set of assemblies for the platforms you want to target. If you re building an app for Windows Phone 8 and Windows 8, this means a set assemblies that are portable on those platforms. A conditional compilation directive is intended to enable different code paths when the code is compiled for different platforms or configurations. This isn t the purpose of Portable Class Libraries. If you need to implement functionality for Windows Phone 8 and implement it differently for Windows 8, you can t include both code paths in a Portable Class Library. Instead, you should abstract away the platform-dependent code and share only the portable, platform-independent code. Doesn t use Windows Runtime APIs Windows Runtime APIs aren t portable and can t be used in a Portable Class Library. There is overlap in the Windows Runtime APIs that are supported on Windows Phone 8 and Windows 8. However, binary compatibility is not supported. Your code has to be compiled for each platform and therefore isn t suitable for a Portable Class Library. Here too you should abstract the use of Windows Runtime APIs into classes or objects that aren t shared in a Portable Class Library. Doesn t use UI constructs Although XAML for Windows Phone 8 and Windows 8 looks the same and for the most part UI controls have the same names, this code isn t portable. Your UI code must be compiled for each platform and therefore isn t a candidate for placement in a Portable Class Library. Portable Class Libraries and MVVM When you create your app for Windows Phone 8 and Windows 8 using the Model-View-ViewModel (MVVM) pattern and APIs, you have the potential to share a lot of code in a Portable Class Library. Your ViewModel and Model can be designed to be portable and you should place these in a Portable Class Library. The views of your app, and the startup code, typically are platform-specific and should be implemented in your Windows Phone 8 and Windows 8 app projects. This is illustrated in the following diagram.

13 If your ViewModel needs to call platform-specific code, you should abstract that functionality into the platform-independent interface and use the interface in the Portable Class Library. The interface can then be implemented in a platform-specific way in each app project. This is a very powerful code-sharing technique and allows binary sharing because the Portable Class Library is compiled once and then used in multiple platforms.

CUDA C Programming Mark Harris NVIDIA Corporation

CUDA C Programming Mark Harris NVIDIA Corporation CUDA C Programming Mark Harris NVIDIA Corporation Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory Models Programming Environment

More information

CUDA Kenjiro Taura 1 / 36

CUDA Kenjiro Taura 1 / 36 CUDA Kenjiro Taura 1 / 36 Contents 1 Overview 2 CUDA Basics 3 Kernels 4 Threads and thread blocks 5 Moving data between host and device 6 Data sharing among threads in the device 2 / 36 Contents 1 Overview

More information

Lecture 3: Introduction to CUDA

Lecture 3: Introduction to CUDA CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Introduction to CUDA Some slides here are adopted from: NVIDIA teaching kit Mohamed Zahran (aka Z)

More information

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA Sreepathi Pai October 18, 2017 URCS Outline Background Memory Code Execution Model Outline Background Memory Code Execution Model

More information

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming KFUPM HPC Workshop April 29-30 2015 Mohamed Mekias HPC Solutions Consultant Introduction to CUDA programming 1 Agenda GPU Architecture Overview Tools of the Trade Introduction to CUDA C Patterns of Parallel

More information

Advanced CUDA Optimizations. Umar Arshad ArrayFire

Advanced CUDA Optimizations. Umar Arshad ArrayFire Advanced CUDA Optimizations Umar Arshad (@arshad_umar) ArrayFire (@arrayfire) ArrayFire World s leading GPU experts In the industry since 2007 NVIDIA Partner Deep experience working with thousands of customers

More information

GPU Computing: Introduction to CUDA. Dr Paul Richmond

GPU Computing: Introduction to CUDA. Dr Paul Richmond GPU Computing: Introduction to CUDA Dr Paul Richmond This lecture CUDA Programming Model CUDA Device Code CUDA Host Code and Memory Management CUDA Compilation Programming

More information

CS179 GPU Programming Recitation 4: CUDA Particles

CS179 GPU Programming Recitation 4: CUDA Particles Recitation 4: CUDA Particles Lab 4 CUDA Particle systems Two parts Simple repeat of Lab 3 Interacting Flocking simulation 2 Setup Two folders given particles_simple, particles_interact Must install NVIDIA_CUDA_SDK

More information

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture Rafia Inam Mälardalen Real-Time Research Centre Mälardalen University, Västerås, Sweden CONTENTS

More information

Module 2: Introduction to CUDA C. Objective

Module 2: Introduction to CUDA C. Objective ECE 8823A GPU Architectures Module 2: Introduction to CUDA C 1 Objective To understand the major elements of a CUDA program Introduce the basic constructs of the programming model Illustrate the preceding

More information

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center An Introduction to GPU Architecture and CUDA C/C++ Programming Bin Chen April 4, 2018 Research Computing Center Outline Introduction to GPU architecture Introduction to CUDA programming model Using the

More information

GPU programming CUDA C. GPU programming,ii. COMP528 Multi-Core Programming. Different ways:

GPU programming CUDA C. GPU programming,ii. COMP528 Multi-Core Programming. Different ways: COMP528 Multi-Core Programming GPU programming,ii Alexei Lisitsa Dept of computer science University of Liverpool Different ways: GPU programming

More information

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research Introduction to CUDA CME343 / ME339 18 May 2011 James Balfour [] NVIDIA Research CUDA Programing system for machines with GPUs Programming Language Compilers Runtime Environments Drivers

More information

Parallel Computing. Lecture 19: CUDA - I

Parallel Computing. Lecture 19: CUDA - I CSCI-UA.0480-003 Parallel Computing Lecture 19: CUDA - I Mohamed Zahran (aka Z) GPU w/ local DRAM (device) Behind CUDA CPU (host) Source:

More information


CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

OpenACC Course. Office Hour #2 Q&A

OpenACC Course. Office Hour #2 Q&A OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle

More information

CS691/SC791: Parallel & Distributed Computing

CS691/SC791: Parallel & Distributed Computing CS691/SC791: Parallel & Distributed Computing Introduction to OpenMP 1 Contents Introduction OpenMP Programming Model and Examples OpenMP programming examples Task parallelism. Explicit thread synchronization.

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Vector Addition on the Device: main()

Vector Addition on the Device: main() Vector Addition on the Device: main() #define N 512 int main(void) { int *a, *b, *c; // host copies of a, b, c int *d_a, *d_b, *d_c; // device copies of a, b, c int size = N * sizeof(int); // Alloc space

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

CUDA (Compute Unified Device Architecture)

CUDA (Compute Unified Device Architecture) CUDA (Compute Unified Device Architecture) Mike Bailey History of GPU Performance vs. CPU Performance GFLOPS Source: NVIDIA G80 = GeForce 8800 GTX G71 = GeForce 7900 GTX G70 = GeForce 7800 GTX NV40 = GeForce

More information

Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo

Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015 Goals Running CUDA code on CPUs. Why? Performance portability! A major challenge faced

More information

Module 2: Introduction to CUDA C

Module 2: Introduction to CUDA C ECE 8823A GPU Architectures Module 2: Introduction to CUDA C 1 Objective To understand the major elements of a CUDA program Introduce the basic constructs of the programming model Illustrate the preceding

More information

CUDA Performance Optimization. Patrick Legresley

CUDA Performance Optimization. Patrick Legresley CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations

More information

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming Overview Lecture 1: an introduction to CUDA Mike Giles hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.

More information

Ch. 3: The C in C++ - Continued -

Ch. 3: The C in C++ - Continued - Ch. 3: The C in C++ - Continued - QUIZ What are the 3 ways a reference can be passed to a C++ function? QUIZ True or false: References behave like constant pointers with automatic dereferencing. QUIZ What

More information

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions. Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication

More information

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

GPU Programming Using CUDA. Samuli Laine NVIDIA Research GPU Programming Using CUDA Samuli Laine NVIDIA Research Today GPU vs CPU Different architecture, different workloads Basics of CUDA Executing code on GPU Managing memory between CPU and GPU CUDA API Quick

More information


INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro INTRODUCTION TO GPU COMPUTING WITH CUDA Topi Siro 19.10.2015 OUTLINE PART I - Tue 20.10 10-12 What is GPU computing? What is CUDA? Running GPU jobs on Triton PART II - Thu 22.10 10-12 Using libraries Different

More information

ECE 574 Cluster Computing Lecture 17

ECE 574 Cluster Computing Lecture 17 ECE 574 Cluster Computing Lecture 17 Vince Weaver 28 March 2019 HW#8 (CUDA) posted. Project topics due. Announcements 1 CUDA installing On Linux

More information

CUDA Programming Model

CUDA Programming Model CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming

More information


CUDA OPTIMIZATIONS ISC 2011 Tutorial CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information


CUDA C/C++ BASICS. NVIDIA Corporation CUDA C/C++ BASICS NVIDIA Corporation What is CUDA? CUDA Architecture Expose GPU parallelism for general-purpose computing Retain performance CUDA C/C++ Based on industry-standard C/C++ Small set of extensions

More information

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh GPU Programming EPCC The University of Edinburgh Contents NVIDIA CUDA C Proprietary interface to NVIDIA architecture CUDA Fortran Provided by PGI OpenCL Cross platform API 2 NVIDIA CUDA CUDA allows NVIDIA

More information

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 CUDA Lecture 2 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 December 15, 2015 CUDA Programming Fundamentals CUDA

More information

Software Engineering /48

Software Engineering /48 Software Engineering 1 /48 Topics 1. The Compilation Process and You 2. Polymorphism and Composition 3. Small Functions 4. Comments 2 /48 The Compilation Process and You 3 / 48 1. Intro - How do you turn

More information



More information

CS201 - Introduction to Programming Glossary By

CS201 - Introduction to Programming Glossary By CS201 - Introduction to Programming Glossary By #include : The #include directive instructs the preprocessor to read and include a file into a source code file. The file name is typically enclosed with

More information

Lecture 2: Introduction to CUDA C

Lecture 2: Introduction to CUDA C CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013 1 CUDA /OpenCL Execution Model Integrated host+device app C program Serial or

More information

ECE 574 Cluster Computing Lecture 15

ECE 574 Cluster Computing Lecture 15 ECE 574 Cluster Computing Lecture 15 Vince Weaver 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements

More information

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

GPU Programming Using CUDA. Samuli Laine NVIDIA Research GPU Programming Using CUDA Samuli Laine NVIDIA Research Today GPU vs CPU Different architecture, different workloads Basics of CUDA Executing code on GPU Managing memory between CPU and GPU CUDA API Quick

More information

Stream Computing using Brook+

Stream Computing using Brook+ Stream Computing using Brook+ School of Electrical Engineering and Computer Science University of Central Florida Slides courtesy of P. Bhaniramka Outline Overview of Brook+ Brook+ Software Architecture

More information

Dynamic Cuda with F# HPC GPU & F# Meetup. March 19. San Jose, California

Dynamic Cuda with F# HPC GPU & F# Meetup. March 19. San Jose, California Dynamic Cuda with F# HPC GPU & F# Meetup March 19 San Jose, California Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 About Us! Software development and consulting company!

More information

CUDA Parallel Programming Model. Scalable Parallel Programming with CUDA

CUDA Parallel Programming Model. Scalable Parallel Programming with CUDA CUDA Parallel Programming Model Scalable Parallel Programming with CUDA Some Design Goals Scale to 100s of cores, 1000s of parallel threads Let programmers focus on parallel algorithms not mechanics of

More information

Practical Introduction to CUDA and GPU

Practical Introduction to CUDA and GPU Practical Introduction to CUDA and GPU Charlie Tang Centre for Theoretical Neuroscience October 9, 2009 Overview CUDA - stands for Compute Unified Device Architecture Introduced Nov. 2006, a parallel computing

More information

Short Notes of CS201

Short Notes of CS201 #includes: Short Notes of CS201 The #include directive instructs the preprocessor to read and include a file into a source code file. The file name is typically enclosed with < and > if the file is a system

More information

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research Reductions and Low-Level Performance Considerations CME343 / ME339 27 May 2011 David Tarjan [] NVIDIA Research REDUCTIONS Reduction! Reduce vector to a single value! Via an associative

More information

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1 Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei

More information

Scientific discovery, analysis and prediction made possible through high performance computing.

Scientific discovery, analysis and prediction made possible through high performance computing. Scientific discovery, analysis and prediction made possible through high performance computing. An Introduction to GPGPU Programming Bob Torgerson Arctic Region Supercomputing Center November 21 st, 2013

More information

CSE 591: GPU Programming. Programmer Interface. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591: GPU Programming. Programmer Interface. Klaus Mueller. Computer Science Department Stony Brook University CSE 591: GPU Programming Programmer Interface Klaus Mueller Computer Science Department Stony Brook University Compute Levels Encodes the hardware capability of a GPU card newer cards have higher compute

More information

CUDA Basics. July 6, 2016

CUDA Basics. July 6, 2016 Mitglied der Helmholtz-Gemeinschaft CUDA Basics July 6, 2016 CUDA Kernels Parallel portion of application: execute as a kernel Entire GPU executes kernel, many threads CUDA threads: Lightweight Fast switching

More information

Speed Up Your Codes Using GPU

Speed Up Your Codes Using GPU Speed Up Your Codes Using GPU Wu Di and Yeo Khoon Seng (Department of Mechanical Engineering) The use of Graphics Processing Units (GPU) for rendering is well known, but their power for general parallel

More information

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization

More information


CUDA C/C++ BASICS. NVIDIA Corporation CUDA C/C++ BASICS NVIDIA Corporation What is CUDA? CUDA Architecture Expose GPU parallelism for general-purpose computing Retain performance CUDA C/C++ Based on industry-standard C/C++ Small set of extensions

More information

CUDA Parallel Programming Model Michael Garland

CUDA Parallel Programming Model Michael Garland CUDA Parallel Programming Model Michael Garland NVIDIA Research Some Design Goals Scale to 100s of cores, 1000s of parallel threads Let programmers focus on parallel algorithms not mechanics of a parallel

More information

Fundamental Optimizations

Fundamental Optimizations Fundamental Optimizations Paulius Micikevicius NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access

More information


CUDA 7.5 OVERVIEW WEBINAR 7/23/15 CUDA 7.5 OVERVIEW WEBINAR 7/23/15 CUDA 7.5 16-bit Floating-Point Storage 2x larger datasets in GPU memory Great for Deep Learning cusparse Dense Matrix * Sparse

More information

QUIZ. What are 3 differences between C and C++ const variables?

QUIZ. What are 3 differences between C and C++ const variables? QUIZ What are 3 differences between C and C++ const variables? Solution QUIZ Source: Solution The C/C++ preprocessor substitutes mechanically,

More information

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Fundamental Optimizations Paulius Micikevicius NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access

More information

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics GPU Programming Rüdiger Westermann Chair for Computer Graphics & Visualization Faculty of Informatics Overview Programming interfaces and support libraries The CUDA programming abstraction An in-depth

More information

CS179 GPU Programming: CUDA Memory. Lecture originally by Luke Durant and Tamas Szalay

CS179 GPU Programming: CUDA Memory. Lecture originally by Luke Durant and Tamas Szalay : CUDA Memory Lecture originally by Luke Durant and Tamas Szalay CUDA Memory Review of Memory Spaces Memory syntax Constant Memory Allocation Issues Global Memory Gotchas Shared Memory Gotchas Texture

More information

Introduction to Scientific Programming using GPGPU and CUDA

Introduction to Scientific Programming using GPGPU and CUDA Introduction to Scientific Programming using GPGPU and CUDA Day 1 Sergio Orlandini Mario Tacconi 0 Hands on: Compiling a CUDA program Environment and utility:

More information

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

Memory concept. Grid concept, Synchronization. GPU Programming.   Szénási Sándor. Memory concept Grid concept, Synchronization GPU Programming Szénási Sándor GPU Education Center of Óbuda University MEMORY CONCEPT Off-chip

More information

Writing and compiling a CUDA code

Writing and compiling a CUDA code Writing and compiling a CUDA code Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) Writing CUDA code 1 / 65 The CUDA language If we want fast code, we (unfortunately)

More information


PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec PROGRAMOVÁNÍ V C++ CVIČENÍ Michal Brabec PARALLELISM CATEGORIES CPU? SSE Multiprocessor SIMT - GPU 2 / 17 PARALLELISM V C++ Weak support in the language itself, powerful libraries Many different parallelization

More information

Blocks, Grids, and Shared Memory

Blocks, Grids, and Shared Memory Blocks, Grids, and Shared Memory GPU Course, Fall 2012 Last week: ax+b Homework Threads, Blocks, Grids CUDA threads are organized into blocks Threads operate in SIMD(ish) manner -- each executing same

More information

Supporting Data Parallelism in Matcloud: Final Report

Supporting Data Parallelism in Matcloud: Final Report Supporting Data Parallelism in Matcloud: Final Report Yongpeng Zhang, Xing Wu 1 Overview Matcloud is an on-line service to run Matlab-like script on client s web browser. Internally it is accelerated by

More information

CUDA Programming. Aiichiro Nakano

CUDA Programming. Aiichiro Nakano CUDA Programming Aiichiro Nakano Collaboratory for Advanced Computing & Simulations Department of Computer Science Department of Physics & Astronomy Department of Chemical Engineering & Materials Science

More information

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY Introduction to CUDA Ingemar Ragnemalm Information Coding, ISY This lecture: Programming model and language Introduction to memory spaces and memory access Shared memory Matrix multiplication example Lecture

More information

Introduction to Parallel Computing with CUDA. Oswald Haan

Introduction to Parallel Computing with CUDA. Oswald Haan Introduction to Parallel Computing with CUDA Oswald Haan Schedule Introduction to Parallel Computing with CUDA Using CUDA CUDA Application Examples Using Multiple GPUs CUDA Application Libraries

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

04 Sharing Code Between Windows 8 and Windows Phone 8 in Visual Studio. Ben Riga

04 Sharing Code Between Windows 8 and Windows Phone 8 in Visual Studio. Ben Riga 04 Sharing Code Between Windows 8 and Windows Phone 8 in Visual Studio Ben Riga Course Topics Building Apps for Both Windows 8 and Windows Phone 8 Jump Start 01 Comparing Windows

More information

Fixed-Point Math and Other Optimizations

Fixed-Point Math and Other Optimizations Fixed-Point Math and Other Optimizations Embedded Systems 8-1 Fixed Point Math Why and How Floating point is too slow and integers truncate the data Floating point subroutines: slower than native, overhead

More information

Hands-on CUDA exercises

Hands-on CUDA exercises Hands-on CUDA exercises CUDA Exercises We have provided skeletons and solutions for 6 hands-on CUDA exercises In each exercise (except for #5), you have to implement the missing portions of the code Finished

More information

Massively Parallel Computing with CUDA. Carlos Alberto Martínez Angeles Cinvestav-IPN

Massively Parallel Computing with CUDA. Carlos Alberto Martínez Angeles Cinvestav-IPN Massively Parallel Computing with CUDA Carlos Alberto Martínez Angeles Cinvestav-IPN What is a GPU? A graphics processing unit (GPU) The term GPU was popularized by Nvidia in 1999 marketed the GeForce

More information

ECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications

ECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications ECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications The NVIDIA GPU Memory Ecosystem Atomic operations in CUDA The thrust library October 7, 2015 Dan Negrut, 2015 ECE/ME/EMA/CS 759

More information

Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011

Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011 Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011 Performance Optimization Process Use appropriate performance metric for each kernel For example, Gflops/s don t make sense for

More information

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34 1 / 34 GPU Programming Lecture 2: CUDA C Basics Miaoqing Huang University of Arkansas 2 / 34 Outline Evolvements of NVIDIA GPU CUDA Basic Detailed Steps Device Memories and Data Transfer Kernel Functions

More information

Hands-on CUDA Optimization. CUDA Workshop

Hands-on CUDA Optimization. CUDA Workshop Hands-on CUDA Optimization CUDA Workshop Exercise Today we have a progressive exercise The exercise is broken into 5 steps If you get lost you can always catch up by grabbing the corresponding directory

More information

Compiling CUDA and Other Languages for GPUs. Vinod Grover and Yuan Lin

Compiling CUDA and Other Languages for GPUs. Vinod Grover and Yuan Lin Compiling CUDA and Other Languages for GPUs Vinod Grover and Yuan Lin Agenda Vision Compiler Architecture Scenarios SDK Components Roadmap Deep Dive SDK Samples Demos Vision Build a platform for GPU computing

More information

CS 326 Operating Systems C Programming. Greg Benson Department of Computer Science University of San Francisco

CS 326 Operating Systems C Programming. Greg Benson Department of Computer Science University of San Francisco CS 326 Operating Systems C Programming Greg Benson Department of Computer Science University of San Francisco Why C? Fast (good optimizing compilers) Not too high-level (Java, Python, Lisp) Not too low-level

More information

Massively Parallel Algorithms

Massively Parallel Algorithms Massively Parallel Algorithms Introduction to CUDA & Many Fundamental Concepts of Parallel Programming G. Zachmann University of Bremen, Germany Hybrid/Heterogeneous Computation/Architecture

More information

P.G.TRB - COMPUTER SCIENCE. c) data processing language d) none of the above

P.G.TRB - COMPUTER SCIENCE. c) data processing language d) none of the above P.G.TRB - COMPUTER SCIENCE Total Marks : 50 Time : 30 Minutes 1. C was primarily developed as a a)systems programming language b) general purpose language c) data processing language d) none of the above

More information

Lecture 6b Introduction of CUDA programming

Lecture 6b Introduction of CUDA programming CS075 1896 1920 1987 2006 Lecture 6b Introduction of CUDA programming 0 1 0, What is CUDA? CUDA Architecture Expose GPU parallelism for general-purpose computing Retain performance CUDA C/C++ Based on

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

CUDA C/C++ Basics GTC 2012 Justin Luitjens, NVIDIA Corporation

CUDA C/C++ Basics GTC 2012 Justin Luitjens, NVIDIA Corporation CUDA C/C++ Basics GTC 2012 Justin Luitjens, NVIDIA Corporation What is CUDA? CUDA Platform Expose GPU computing for general purpose Retain performance CUDA C/C++ Based on industry-standard C/C++ Small

More information

University of Bielefeld

University of Bielefeld Geistes-, Natur-, Sozial- und Technikwissenschaften gemeinsam unter einem Dach Introduction to GPU Programming using CUDA Olaf Kaczmarek University of Bielefeld STRONGnet Summerschool 2011 ZIF Bielefeld

More information

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU

More information

Optimizing CUDA for GPU Architecture. CSInParallel Project

Optimizing CUDA for GPU Architecture. CSInParallel Project Optimizing CUDA for GPU Architecture CSInParallel Project August 13, 2014 CONTENTS 1 CUDA Architecture 2 1.1 Physical Architecture........................................... 2 1.2 Virtual Architecture...........................................

More information



More information

NumbaPro CUDA Python. Square matrix multiplication

NumbaPro CUDA Python. Square matrix multiplication NumbaPro Enables parallel programming in Python Support various entry points: Low-level (CUDA-C like) programming language High-level array oriented interface CUDA library bindings Also support multicore

More information

CS377P Programming for Performance GPU Programming - I

CS377P Programming for Performance GPU Programming - I CS377P Programming for Performance GPU Programming - I Sreepathi Pai UTCS November 9, 2015 Outline 1 Introduction to CUDA 2 Basic Performance 3 Memory Performance Outline 1 Introduction to CUDA 2 Basic

More information

Parallel Programming on Larrabee. Tim Foley Intel Corp

Parallel Programming on Larrabee. Tim Foley Intel Corp Parallel Programming on Larrabee Tim Foley Intel Corp Motivation This morning we talked about abstractions A mental model for GPU architectures Parallel programming models Particular tools and APIs This

More information

Lecture 1: an introduction to CUDA

Lecture 1: an introduction to CUDA Lecture 1: an introduction to CUDA Mike Giles Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Overview hardware view software view CUDA programming

More information

Memory. Lecture 2: different memory and variable types. Memory Hierarchy. CPU Memory Hierarchy. Main memory

Memory. Lecture 2: different memory and variable types. Memory Hierarchy. CPU Memory Hierarchy. Main memory Memory Lecture 2: different memory and variable types Prof. Mike Giles Oxford University Mathematical Institute Oxford e-research Centre Key challenge in modern computer architecture

More information

CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University CSE 591: GPU Programming Using CUDA in Practice Klaus Mueller Computer Science Department Stony Brook University Code examples from Shane Cook CUDA Programming Related to: score boarding load and store

More information

An Introduction to GPU Computing and CUDA Architecture

An Introduction to GPU Computing and CUDA Architecture An Introduction to GPU Computing and CUDA Architecture Sarah Tariq, NVIDIA Corporation GPU Computing GPU: Graphics Processing Unit Traditionally used for real-time rendering High computational density

More information

C++ for System Developers with Design Pattern

C++ for System Developers with Design Pattern C++ for System Developers with Design Pattern Introduction: This course introduces the C++ language for use on real time and embedded applications. The first part of the course focuses on the language

More information

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction Francesco Rossi University of Bologna and INFN * Using this terminology since you ve already heard of SIMD and SPMD at this school

More information

GPU Programming with Ateji PX June 8 th Ateji All rights reserved.

GPU Programming with Ateji PX June 8 th Ateji All rights reserved. GPU Programming with Ateji PX June 8 th 2010 Ateji All rights reserved. Goals Write once, run everywhere, even on a GPU Target heterogeneous architectures from Java GPU accelerators OpenCL standard Get

More information