Compiling CUDA and Other Languages for GPUs. Vinod Grover and Yuan Lin

Size: px
Start display at page:

Download "Compiling CUDA and Other Languages for GPUs. Vinod Grover and Yuan Lin"

Transcription

1 Compiling CUDA and Other Languages for GPUs Vinod Grover and Yuan Lin

2 Agenda Vision Compiler Architecture Scenarios SDK Components Roadmap Deep Dive SDK Samples Demos

3 Vision Build a platform for GPU computing around foundations of CUDA. Bring other languages to GPUs Enable CUDA for other platforms Make that platform available for ISVs, researchers, and hobbyists Create a flourishing eco-system NVIDIA GPUs CUDA C, C++ LLVM Compiler For CUDA x86 CPUs New Language Support New Processor Support

4 Compiler Architecture

5 CUDA Compiler Architecture (4.1, 4.2 and 5.0) NVCC.cu CUDA FE LLVM Optimizer NVPTX CodeGen PTX PTXAS Host Compiler CUDA Runtime CUDA Driver

6 Open Compiler Architecture NVCC.cu CUDA FE libcuda.lang NVVM IR LLVM Optimizer NVPTX CodeGen libnvvm Open Sourced PTX PTXAS Host Compiler CUDA Runtime CUDA Driver

7 Scenarios

8 Building Production Quality Compilers NVCC libcuda.lang CUDA Fortran OpenACC libnvvm CUDA Runtime

9 Building Domain Specific Languages NVCC libcuda.lang DSL Front End libnvvm DSL Runtime CUDA Runtime

10 Enabling Other Platforms NVCC libcuda.lang x86 LLVM Backend libnvvm x86 CUDA Runtime CUDA Runtime

11 Enabling Research in GPU Computing CU++ CLANG x86 LLVM Backend libnvvm Custom Runtime

12 Compiler SDK Components Open source NVPTX backend NVVM IR specification libnvvm libcuda.lang

13 Enabling open source GPU compilers Contribute NVPTX code generator sources back to LLVM trunk Sources available directly from LLVM! Standard LLVM license Invite community to: Use it! Contribute improvements to it!

14 NVVM IR An intermediate representation based on LLVM and targeted for GPU computing IR Specification describes: Address spaces Kernels and device functions Intrinsics more

15 libnvvm A binary component library for Windows, Linux and MacOS targeting PTX 3.1 Fermi, Kepler and later architectures 32-bit and 64-bit hosts NVVM IR verifier API document Sample applications

16 libcuda.lang A binary library that takes CUDA source program and generates NVVM IR and host C++ program for supported platforms API document Sample applications

17 Roadmap

18 Availability Open source NVPTX backend May 2012 NVVM IR spec, libnvvm, samples Preview release (May 2012) Production release (Next release after CUDA 5.0) libcuda.lang To be announced

19 Deep Dive

20 Deep Dive NVVM IR libnvvm SDK Samples

21 NVVM IR Designed to represent GPU kernel functions and device functions Represents the code executed by each CUDA thread NVVM IR Specification 1.0

22 NVVM IR and LLVM IR NVVM IR Based on LLVM IR With a set of rules and intrinsics No new types. No new operators. No new reserved words. An NVVM IR program can work with any standard LLVM IR tool llvm-as llvm-link llvm-extract llvm-dis llvm-ar An NVVM IR program can be built with the standard LLVM distribution. svn co llvm

23 NVVM IR

24 NVVM IR: Address Spaces

25 NVVM IR: Address Spaces CUDA C/C++ Address space is a storage qualifier. A pointers is generic pointer, which can point to any address space. global int g; shared int s; constant int c; void foo(int a) { int l; int *p ; switch (a) { case 1: p = &g; case 2: p = &s; case 3: p = &c; case 4: p = &l; } }

26 NVVM IR: Address Spaces CUDA C/C++ Address space is a storage qualifier. A pointers is generic pointer, which can point to any address space. OpenCL C Address space is part of the type system. A pointer type must be qualified with an address space. global int g; constant int c; foo(local int *ps) { int l; int *p ; p = l; global int *pg = &g; constant int *pc = &c; }

27 NVVM IR: Address Spaces CUDA C/C++ Address space is a storage qualifier. A pointers is generic pointer, which can point to any address space. OpenCL C Address space is part of the type system. A pointer type must be qualified with an address space. NVVM IR Support both in the same program.

28 NVVM IR: Address Spaces Define address space numbers Allow generic pointers and specific pointers Provide intrinsics to perform conversions between generic pointers and specific pointers

29 NVVM IR: GPU Program Properties Properties: Maximum/minimum expected CTA size from any launch Minimum number of CTAs on an SM Kernel function vs. device function Texture/surface variables more Use Named Metadata

30 NVVM IR: Intrinsics Atomic operations Barriers Address space conversions Special registers read Texture/surface access more

31 NVVM IR: NVVM ABI for PTX Allow interoperability of PTX codes generated by different NVVM compilers How NVVM linkage types are mapped to PTX linkage types How NVVM data types are mapped to PTX data types How function calls are lowered to PTX calls

32 libnvvm Library Address space access optimization Thread convergence analyses Re-materialization Load / store coalescing sign extension elimination Inline asm (PTX) support Tuning of existing optimizations New phi elimination libnvvm : An optimizing compiler library NVVM IR -> PTX

33 libnvvm Library CUDA C compiler DSL compiler Tools Your Compiler libnvvm : An optimizing compiler library NVVM IR -> PTX

34 libnvvm APIs Start and shutdown Create a compilation unit Add NVVM IR modules to a compilation unit. Support NVVM IR level linking Verify the input IR Compile to PTX Get result Get back PTX string Get back message log

35 Samples

36 SDK Sample 1: Simple Read in an NVVM IR program in LL or BC format from disk Generate PTX Execute the PTX on GPU // Read in fread (source, size, 1, fh); // Compile nvvminit (); nvvmcreatecu (&cu); nvvmcuaddmodule (cu, source, size); nvvmcompilecu (cu, num_options, options); nvvmgetcompiledresultsize (cu, &ptxsize); nvvmgetcompiledresult (cu, ptx); nvvmdestroycu (&cu); nvvmfini (); // Execute... cumoduleloaddataex (phmodule, ptx, 0, 0, 0); cumodulegetfunction (phkernel, *phmodule, "simple"); culaunchgrid (hkernel, nblocks, 1))

37 SDK Sample 1: Simple IR from file NVVM IR -> PTX libnvvm PTX JIT and Execution on GPU

38 SDK Sample 2: ptxgen Read in a list of NVVM IR programs in LL or BC format from disk Perform IR level linking for (i=0; i<num_files; i++) nvvmcuaddmodule (cu, source[i], size[i]); Generate PTX, or verify the IR against NVVM IR spec nvvmcompilecu (cu, 1, -target=nvptx ); nvvmcompilecu (cu, 1, -target=verify );

39 SDK Sample 2: ptxgen IR from file libnvvm NVVM IR -> PTX IR level linking IR Verifier

40 SDK Sample 3: Kaleidoscope Based on the Kaleidoscope example in LLVM tutorial An interpreter for a simple language build expressions using primitive operators and user defined functions Can evaluate a function on a GPU over a sequence of arguments Uses LLVM IR builder Statically linked with stock llvm 3.0 libllvmcore.a and libllvmbitwriter.a Dynamically linked with libnvvm.so and libcuda.so.

41 SDK Sample 3: Kaleidoscope ready> def foo(x) ready> 4*x; ready> eval foo ; ready> Evaluating foo on a gpu: Using CUDA Device [0]: GeForce GTX ready> def bar(a b) ready> foo(a)+b; ready> eval bar ; ready> Evaluating bar on a gpu: Using CUDA Device [0]: GeForce GTX

42 SDK Sample 3: Kaleidoscope interpreter LLVM IR Builder libnvvm NVVM IR -> PTX PTX JIT and Execution on GPU

43 SDK Sample 4: Glang A prototype compiler based on Clang Compile a subset of C++ programs to execute on GPUs Has its own set of builtin functions

44 glang_kernel kernel_vector_add( global const float* a, global const float* b, global float* c, unsigned int size) { // Determine thread/block id/size int tidx = glang_tid_x(); int bidx = glang_ctaid_x(); int bsize = glang_ntid_x(); // Compute offset into vector int index = bidx*bsize + tidx; } // Compute addition for our element if(index < size) { c[index] = a[index] + b[index]; }

45 SDK Samples clang libnvvm NVVM IR -> PTX IR level linking User builtin functions PTX JIT and Execution on GPU

46 SDK Sample 5: Rg R is a language and environment for statistical computing and graphics Dynamically compile R code and execute it on a GPU Supports a useful subset of R An example of how to accelerate a DSL using libnvvm

47 SDK Sample 5: Rg > v1 = c (1, 2, 3, 4) > dv1 = rg.dv (c (1, 2, 3, 4)) > dv2 = rg.dv (c (10, 20, 30, 40)) > dv3 = rg.gapply (function(x, y) {x+y;}, dv1, dv2) > as.double (dv3) [1]

48 SDK Sample 5: Rg # define the code for Mandelbrot mandelbrot <- function(x0, y0) { iteration <- 0L; max_iteration <- 50L; x <- 0; y <- 0; while ( (x*x + y*y < 4) && (iteration < max_iteration) ) { xtemp <- x*x - y*y + x0; y = 2*x*y + y0; x = xtemp; iteration = iteration + 1L; } color = iteration; color; } # create data dv_points_x=rg.dv(points_x); dv_points_y=rg.dv(points_y); # compile and run and get results! dv_points_color = rg.gapply(mandelbrot, dv_points_x, dv_points_y); colorvec = as.integer(dv_points_color);

49 Demo

50 SDK Sample 5: Rg R LLVM IR Builder NVVM IR -> PTX libnvvm PTX JIT and Execution on GPU

51 SDK Samples intepreter R IR from file clang LLVM IR Builder libnvvm NVVM IR -> PTX IR level linking IR Verifier User builtin functions PTX JIT and Execution on GPU

52 How to get it? Open to all NVIDIA registered developers Available with CUDA 5.0 Preview Release File bugs and post questions to NVIDIA forum Tag NVVM

53

54 Back up slides

55 SDK Sample 3: Kaleidoscope ks Interpreter

56 SDK Sample 4: Glang a.glang a.glang.cpp a.glang.hpp a.out host.cpp libglangrt.so libcuda.so

57 SDK Sample 4: Glang a.glang a.glang.cpp a.glang.hpp glang a.out host.cpp libglangrt.so libcuda.so

58 SDK Sample 5: Rg (Overview) Create R objects as proxies for GPU vector data Rg runtime handles data creation and transfers to and from the GPU Write scalar R functions Use rg.gapply to map the scalar function to the vector data Dynamically create code for the scalar function. Generated code is specialized to the runtime type of the respective vector elements Launch the generated code on the GPU and create an R object which is a proxy for the result.

59 SDK Sample 5: RG (Compiler) rg.gapply(f, v1, v2,, vn) compiles f and launches the generated code on vector arguments v1,, vn already resident on the GPU. v1,, vn are atomic vectors of element types t1,, tn discovered at runtime. Creates a type specialized LLVM IR for f with signature f: t1 x x tn -> T where T is the inferred type of the body of f

GPU Computing with NVIDIA s new Kepler Architecture

GPU Computing with NVIDIA s new Kepler Architecture GPU Computing with NVIDIA s new Kepler Architecture Axel Koehler Sr. Solution Architect HPC HPC Advisory Council Meeting, March 13-15 2013, Lugano 1 NVIDIA: Parallel Computing Company GPUs: GeForce, Quadro,

More information

Porting Fabric Engine to NVIDIA Unified Memory: A Case Study. Peter Zion Chief Architect Fabric Engine Inc.

Porting Fabric Engine to NVIDIA Unified Memory: A Case Study. Peter Zion Chief Architect Fabric Engine Inc. Porting Fabric Engine to NVIDIA Unified Memory: A Case Study Peter Zion Chief Architect Fabric Engine Inc. What is Fabric Engine? A high-performance platform for building 3D content creation applications,

More information

OPENMP GPU OFFLOAD IN FLANG AND LLVM. Guray Ozen, Simone Atzeni, Michael Wolfe Annemarie Southwell, Gary Klimowicz

OPENMP GPU OFFLOAD IN FLANG AND LLVM. Guray Ozen, Simone Atzeni, Michael Wolfe Annemarie Southwell, Gary Klimowicz OPENMP GPU OFFLOAD IN FLANG AND LLVM Guray Ozen, Simone Atzeni, Michael Wolfe Annemarie Southwell, Gary Klimowicz MOTIVATION What does HPC programmer need today? Performance à GPUs, multi-cores, other

More information

gpucc: An Open-Source GPGPU Compiler

gpucc: An Open-Source GPGPU Compiler gpucc: An Open-Source GPGPU Compiler Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, Robert Hundt One-Slide Overview Motivation

More information

gpucc: An Open-Source GPGPU Compiler

gpucc: An Open-Source GPGPU Compiler gpucc: An Open-Source GPGPU Compiler Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, Robert Hundt One-Slide Overview Motivation

More information

Practical Introduction to CUDA and GPU

Practical Introduction to CUDA and GPU Practical Introduction to CUDA and GPU Charlie Tang Centre for Theoretical Neuroscience October 9, 2009 Overview CUDA - stands for Compute Unified Device Architecture Introduced Nov. 2006, a parallel computing

More information

Dynamic Cuda with F# HPC GPU & F# Meetup. March 19. San Jose, California

Dynamic Cuda with F# HPC GPU & F# Meetup. March 19. San Jose, California Dynamic Cuda with F# HPC GPU & F# Meetup March 19 San Jose, California Dr. Daniel Egloff daniel.egloff@quantalea.net +41 44 520 01 17 +41 79 430 03 61 About Us! Software development and consulting company!

More information

Designing a Domain-specific Language to Simulate Particles. dan bailey

Designing a Domain-specific Language to Simulate Particles. dan bailey Designing a Domain-specific Language to Simulate Particles dan bailey Double Negative Largest Visual Effects studio in Europe Offices in London and Singapore Large and growing R & D team Squirt Fluid Solver

More information

GPUCC An Open-Source GPGPU Compiler A Preview

GPUCC An Open-Source GPGPU Compiler A Preview GPUCC An GPGPU Compiler A Preview Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Jingyue Wu, Xuetian Weng, Artem Belevich, Robert Hundt (rhundt@google.com) Why

More information

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions. Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication

More information

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler Taylor Lloyd Jose Nelson Amaral Ettore Tiotto University of Alberta University of Alberta IBM Canada 1 Why? 2 Supercomputer Power/Performance GPUs

More information

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems Ed Hinkel Senior Sales Engineer Agenda Overview - Rogue Wave & TotalView GPU Debugging with TotalView Nvdia CUDA Intel Phi 2

More information

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Dirk Ribbrock Faculty of Mathematics, TU dortmund 2016 Table of Contents Why parallel

More information

CUDA Programming Model

CUDA Programming Model CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming

More information

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA Sreepathi Pai October 18, 2017 URCS Outline Background Memory Code Execution Model Outline Background Memory Code Execution Model

More information

Heterogeneous Computing

Heterogeneous Computing Heterogeneous Computing Featured Speaker Ben Sander Senior Fellow Advanced Micro Devices (AMD) DR. DOBB S: GPU AND CPU PROGRAMMING WITH HETEROGENEOUS SYSTEM ARCHITECTURE Ben Sander AMD Senior Fellow APU:

More information

CSE 591: GPU Programming. Programmer Interface. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591: GPU Programming. Programmer Interface. Klaus Mueller. Computer Science Department Stony Brook University CSE 591: GPU Programming Programmer Interface Klaus Mueller Computer Science Department Stony Brook University Compute Levels Encodes the hardware capability of a GPU card newer cards have higher compute

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

Future Directions for CUDA Presented by Robert Strzodka

Future Directions for CUDA Presented by Robert Strzodka Future Directions for CUDA Presented by Robert Strzodka Authored by Mark Harris NVIDIA Corporation Platform for Parallel Computing Platform The CUDA Platform is a foundation that supports a diverse parallel

More information

NVIDIA CUDA. Fermi Compatibility Guide for CUDA Applications. Version 1.1

NVIDIA CUDA. Fermi Compatibility Guide for CUDA Applications. Version 1.1 NVIDIA CUDA Fermi Compatibility Guide for CUDA Applications Version 1.1 4/19/2010 Table of Contents Software Requirements... 1 What Is This Document?... 1 1.1 Application Compatibility on Fermi... 1 1.2

More information

A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function

A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function Chen-Ting Chang, Yu-Sheng Chen, I-Wei Wu, and Jyh-Jiun Shann Dept. of Computer Science, National Chiao

More information

CUDA Architecture & Programming Model

CUDA Architecture & Programming Model CUDA Architecture & Programming Model Course on Multi-core Architectures & Programming Oliver Taubmann May 9, 2012 Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New

More information

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca (Slides http://support.scinet.utoronto.ca/ northrup/westgrid CUDA.pdf) March 12, 2014

More information

The Role of Standards in Heterogeneous Programming

The Role of Standards in Heterogeneous Programming The Role of Standards in Heterogeneous Programming Multi-core Challenge Bristol UWE 45 York Place, Edinburgh EH1 3HP June 12th, 2013 Codeplay Software Ltd. Incorporated in 1999 Based in Edinburgh, Scotland

More information

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique GPU programming: CUDA basics Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr This lecture: CUDA programming We have seen some GPU architecture Now how to program it? 2 Outline

More information

Advanced CUDA Optimizations. Umar Arshad ArrayFire

Advanced CUDA Optimizations. Umar Arshad ArrayFire Advanced CUDA Optimizations Umar Arshad (@arshad_umar) ArrayFire (@arrayfire) ArrayFire World s leading GPU experts In the industry since 2007 NVIDIA Partner Deep experience working with thousands of customers

More information

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture

More information

A simple tutorial on generating PTX assembler out of Ada source code using LLVM NVPTX backend

A simple tutorial on generating PTX assembler out of Ada source code using LLVM NVPTX backend Institute of Computational Science A simple tutorial on generating PTX assembler out of Ada source code using LLVM NVPTX backend Dmitry Mikushin dmitrymikushin@usich September 14, 2012 Dmitry Mikushin

More information

Domain Specific Languages for Financial Payoffs. Matthew Leslie Bank of America Merrill Lynch

Domain Specific Languages for Financial Payoffs. Matthew Leslie Bank of America Merrill Lynch Domain Specific Languages for Financial Payoffs Matthew Leslie Bank of America Merrill Lynch Outline Introduction What, How, and Why do we use DSLs in Finance? Implementation Interpreting, Compiling Performance

More information

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca March 13, 2014 Outline 1 Heterogeneous Computing 2 GPGPU - Overview Hardware Software

More information

OpenACC Course. Office Hour #2 Q&A

OpenACC Course. Office Hour #2 Q&A OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle

More information

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming KFUPM HPC Workshop April 29-30 2015 Mohamed Mekias HPC Solutions Consultant Introduction to CUDA programming 1 Agenda GPU Architecture Overview Tools of the Trade Introduction to CUDA C Patterns of Parallel

More information

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum GPU & High Performance Computing (by NVIDIA) CUDA Compute Unified Device Architecture 29.02.2008 Florian Schornbaum GPU Computing Performance In the last few years the GPU has evolved into an absolute

More information

Accelerating Stateflow With LLVM

Accelerating Stateflow With LLVM Accelerating Stateflow With LLVM By Dale Martin Dale.Martin@mathworks.com 2015 The MathWorks, Inc. 1 What is Stateflow? A block in Simulink, which is a graphical language for modeling algorithms 2 What

More information

Relay: a high level differentiable IR. Jared Roesch TVMConf December 12th, 2018

Relay: a high level differentiable IR. Jared Roesch TVMConf December 12th, 2018 Relay: a high level differentiable IR Jared Roesch TVMConf December 12th, 2018!1 This represents months of joint work with lots of great folks:!2 TVM Stack Optimization Relay High-Level Differentiable

More information

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction Francesco Rossi University of Bologna and INFN * Using this terminology since you ve already heard of SIMD and SPMD at this school

More information

Ocelot: An Open Source Debugging and Compilation Framework for CUDA

Ocelot: An Open Source Debugging and Compilation Framework for CUDA Ocelot: An Open Source Debugging and Compilation Framework for CUDA Gregory Diamos*, Andrew Kerr*, Sudhakar Yalamanchili Computer Architecture and Systems Laboratory School of Electrical and Computer Engineering

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Gernot Ziegler, Developer Technology (Compute) (Material by Thomas Bradley) Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

OpenCL: History & Future. November 20, 2017

OpenCL: History & Future. November 20, 2017 Mitglied der Helmholtz-Gemeinschaft OpenCL: History & Future November 20, 2017 OpenCL Portable Heterogeneous Computing 2 APIs and 2 kernel languages C Platform Layer API OpenCL C and C++ kernel language

More information

NumbaPro CUDA Python. Square matrix multiplication

NumbaPro CUDA Python. Square matrix multiplication NumbaPro Enables parallel programming in Python Support various entry points: Low-level (CUDA-C like) programming language High-level array oriented interface CUDA library bindings Also support multicore

More information

Developing Portable CUDA C/C++ Code with Hemi

Developing Portable CUDA C/C++ Code with Hemi Developing Portable CUDA C/C++ Code with Hemi Software development is as much about writing code fast as it is about writing fast code, and central to rapid development is software reuse and portability.

More information

Mapping C++ AMP to OpenCL / HSA Wen-Heng Jack Chung

Mapping C++ AMP to OpenCL / HSA Wen-Heng Jack Chung Mapping C++ AMP to OpenCL / HSA Wen-Heng Jack Chung 1 MulticoreWare Founded in 2009 Largest Independent OpenCL Team Locations Changchun Champaign Beijing St. Louis Taiwan Sunnyvale

More information

KEPLER COMPATIBILITY GUIDE FOR CUDA APPLICATIONS

KEPLER COMPATIBILITY GUIDE FOR CUDA APPLICATIONS KEPLER COMPATIBILITY GUIDE FOR CUDA APPLICATIONS DA-06287-001_v5.0 October 2012 Application Note TABLE OF CONTENTS Chapter 1. Kepler Compatibility... 1 1.1 About this Document... 1 1.2 Application Compatibility

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics GPU Programming Rüdiger Westermann Chair for Computer Graphics & Visualization Faculty of Informatics Overview Programming interfaces and support libraries The CUDA programming abstraction An in-depth

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

Profiling & Tuning Applications. CUDA Course István Reguly

Profiling & Tuning Applications. CUDA Course István Reguly Profiling & Tuning Applications CUDA Course István Reguly Introduction Why is my application running slow? Work it out on paper Instrument code Profile it NVIDIA Visual Profiler Works with CUDA, needs

More information

CUDA Development Using NVIDIA Nsight, Eclipse Edition. David Goodwin

CUDA Development Using NVIDIA Nsight, Eclipse Edition. David Goodwin CUDA Development Using NVIDIA Nsight, Eclipse Edition David Goodwin NVIDIA Nsight Eclipse Edition CUDA Integrated Development Environment Project Management Edit Build Debug Profile SC'12 2 Powered By

More information

SPOC : GPGPU programming through Stream Processing with OCaml

SPOC : GPGPU programming through Stream Processing with OCaml SPOC : GPGPU programming through Stream Processing with OCaml Mathias Bourgoin - Emmanuel Chailloux - Jean-Luc Lamotte January 23rd, 2012 GPGPU Programming Two main frameworks Cuda OpenCL Different Languages

More information

OpenCL Overview. Shanghai March Neil Trevett Vice President Mobile Content, NVIDIA President, The Khronos Group

OpenCL Overview. Shanghai March Neil Trevett Vice President Mobile Content, NVIDIA President, The Khronos Group Copyright Khronos Group, 2012 - Page 1 OpenCL Overview Shanghai March 2012 Neil Trevett Vice President Mobile Content, NVIDIA President, The Khronos Group Copyright Khronos Group, 2012 - Page 2 Processor

More information

Progress Report on QDP-JIT

Progress Report on QDP-JIT Progress Report on QDP-JIT F. T. Winter Thomas Jefferson National Accelerator Facility USQCD Software Meeting 14 April 16-17, 14 at Jefferson Lab F. Winter (Jefferson Lab) QDP-JIT USQCD-Software 14 1 /

More information

PTX WRITER'S GUIDE TO INTEROPERABILITY

PTX WRITER'S GUIDE TO INTEROPERABILITY PTX WRITER'S GUIDE TO INTEROPERABILITY TRM-06721-001_v9.1 April 2018 Reference Guide TABLE OF CONTENTS Chapter 1. Introduction...1 Chapter 2. Data Representation... 2 2.1. Fundamental Types... 2 2.2. Aggregates

More information

INTRODUCTION TO LLVM Bo Wang SA 2016 Fall

INTRODUCTION TO LLVM Bo Wang SA 2016 Fall INTRODUCTION TO LLVM Bo Wang SA 2016 Fall LLVM Basic LLVM IR LLVM Pass OUTLINE What is LLVM? LLVM is a compiler infrastructure designed as a set of reusable libraries with well-defined interfaces. Implemented

More information

Unified Memory. Notes on GPU Data Transfers. Andreas Herten, Forschungszentrum Jülich, 24 April Member of the Helmholtz Association

Unified Memory. Notes on GPU Data Transfers. Andreas Herten, Forschungszentrum Jülich, 24 April Member of the Helmholtz Association Unified Memory Notes on GPU Data Transfers Andreas Herten, Forschungszentrum Jülich, 24 April 2017 Handout Version Overview, Outline Overview Unified Memory enables easy access to GPU development But some

More information

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research Introduction to CUDA CME343 / ME339 18 May 2011 James Balfour [ jbalfour@nvidia.com] NVIDIA Research CUDA Programing system for machines with GPUs Programming Language Compilers Runtime Environments Drivers

More information

OpenACC (Open Accelerators - Introduced in 2012)

OpenACC (Open Accelerators - Introduced in 2012) OpenACC (Open Accelerators - Introduced in 2012) Open, portable standard for parallel computing (Cray, CAPS, Nvidia and PGI); introduced in 2012; GNU has an incomplete implementation. Uses directives in

More information

CUDA 7.5 OVERVIEW WEBINAR 7/23/15

CUDA 7.5 OVERVIEW WEBINAR 7/23/15 CUDA 7.5 OVERVIEW WEBINAR 7/23/15 CUDA 7.5 https://developer.nvidia.com/cuda-toolkit 16-bit Floating-Point Storage 2x larger datasets in GPU memory Great for Deep Learning cusparse Dense Matrix * Sparse

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors

OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors 1 Agenda OpenCL Overview of Platform, Execution and Memory models Mapping these models to AM57x Overview of OpenMP Offload Model Compare and contrast

More information

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center An Introduction to GPU Architecture and CUDA C/C++ Programming Bin Chen April 4, 2018 Research Computing Center Outline Introduction to GPU architecture Introduction to CUDA programming model Using the

More information

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4 OpenACC Course Class #1 Q&A Contents OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4 OpenACC/CUDA/OpenMP Q: Is OpenACC an NVIDIA standard or is it accepted

More information

OpenACC2 vs.openmp4. James Lin 1,2 and Satoshi Matsuoka 2

OpenACC2 vs.openmp4. James Lin 1,2 and Satoshi Matsuoka 2 2014@San Jose Shanghai Jiao Tong University Tokyo Institute of Technology OpenACC2 vs.openmp4 he Strong, the Weak, and the Missing to Develop Performance Portable Applica>ons on GPU and Xeon Phi James

More information

Stream Computing using Brook+

Stream Computing using Brook+ Stream Computing using Brook+ School of Electrical Engineering and Computer Science University of Central Florida Slides courtesy of P. Bhaniramka Outline Overview of Brook+ Brook+ Software Architecture

More information

Michel Steuwer.

Michel Steuwer. Michel Steuwer http://homepages.inf.ed.ac.uk/msteuwer/ SKELCL: Algorithmic Skeletons for GPUs X i a i b i = reduce (+) 0 (zip ( ) A B) #include #include #include

More information

GPU Programming with Ateji PX June 8 th Ateji All rights reserved.

GPU Programming with Ateji PX June 8 th Ateji All rights reserved. GPU Programming with Ateji PX June 8 th 2010 Ateji All rights reserved. Goals Write once, run everywhere, even on a GPU Target heterogeneous architectures from Java GPU accelerators OpenCL standard Get

More information

GPU programming CUDA C. GPU programming,ii. COMP528 Multi-Core Programming. Different ways:

GPU programming CUDA C. GPU programming,ii. COMP528 Multi-Core Programming. Different ways: COMP528 Multi-Core Programming GPU programming,ii www.csc.liv.ac.uk/~alexei/comp528 Alexei Lisitsa Dept of computer science University of Liverpool a.lisitsa@.liverpool.ac.uk Different ways: GPU programming

More information

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.

More information

SIGGRAPH Briefing August 2014

SIGGRAPH Briefing August 2014 Copyright Khronos Group 2014 - Page 1 SIGGRAPH Briefing August 2014 Neil Trevett VP Mobile Ecosystem, NVIDIA President, Khronos Copyright Khronos Group 2014 - Page 2 Significant Khronos API Ecosystem Advances

More information

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3

More information

Real-time Graphics 9. GPGPU

Real-time Graphics 9. GPGPU 9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing GPGPU general-purpose

More information

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer Parallel Programming and Debugging with CUDA C Geoff Gerfin Sr. System Software Engineer CUDA - NVIDIA s Architecture for GPU Computing Broad Adoption Over 250M installed CUDA-enabled GPUs GPU Computing

More information

naïve GPU kernels generation from Fortran source code Dmitry Mikushin

naïve GPU kernels generation from Fortran source code Dmitry Mikushin KernelGen naïve GPU kernels generation from Fortran source code Dmitry Mikushin Contents Motivation and target Assembling our own toolchain: schemes and details Toolchain usecase: sincos example Development

More information

Advanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2016

Advanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2016 Advanced OpenACC John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center Copyright 2016 Outline Loop Directives Data Declaration Directives Data Regions Directives Cache directives Wait

More information

Research Faculty Summit Systems Fueling future disruptions

Research Faculty Summit Systems Fueling future disruptions Research Faculty Summit 2018 Systems Fueling future disruptions Wolong: A Back-end Optimizer for Deep Learning Computation Jilong Xue Researcher, Microsoft Research Asia System Challenge in Deep Learning

More information

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten GPU Computing: Development and Analysis Part 1 Anton Wijs Muhammad Osama Marieke Huisman Sebastiaan Joosten NLeSC GPU Course Rob van Nieuwpoort & Ben van Werkhoven Who are we? Anton Wijs Assistant professor,

More information

KernelGen a toolchain for automatic GPU-centric applications porting. Nicolas Lihogrud Dmitry Mikushin Andrew Adinets

KernelGen a toolchain for automatic GPU-centric applications porting. Nicolas Lihogrud Dmitry Mikushin Andrew Adinets P A R A L L E L C O M P U T A T I O N A L T E C H N O L O G I E S ' 2 0 1 2 KernelGen a toolchain for automatic GPU-centric applications porting Nicolas Lihogrud Dmitry Mikushin Andrew Adinets Contents

More information

Numba: A Compiler for Python Functions

Numba: A Compiler for Python Functions Numba: A Compiler for Python Functions Stan Seibert Director of Community Innovation @ Anaconda My Background 2008: Ph.D. on the Sudbury Neutrino Observatory 2008-2013: Postdoc working on SNO, SNO+, LBNE

More information

Real-time Graphics 9. GPGPU

Real-time Graphics 9. GPGPU Real-time Graphics 9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

JCudaMP: OpenMP/Java on CUDA

JCudaMP: OpenMP/Java on CUDA JCudaMP: OpenMP/Java on CUDA Georg Dotzler, Ronald Veldema, Michael Klemm Programming Systems Group Martensstraße 3 91058 Erlangen Motivation Write once, run anywhere - Java Slogan created by Sun Microsystems

More information

HKG OpenCL Support by NNVM & TVM. Jammy Zhou - Linaro

HKG OpenCL Support by NNVM & TVM. Jammy Zhou - Linaro HKG18-417 OpenCL Support by NNVM & TVM Jammy Zhou - Linaro Agenda OpenCL Overview OpenCL in NNVM & TVM Current Status OpenCL Introduction Open Computing Language Open standard maintained by Khronos with

More information

CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University CSE 591: GPU Programming Using CUDA in Practice Klaus Mueller Computer Science Department Stony Brook University Code examples from Shane Cook CUDA Programming Related to: score boarding load and store

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

Blocks, Grids, and Shared Memory

Blocks, Grids, and Shared Memory Blocks, Grids, and Shared Memory GPU Course, Fall 2012 Last week: ax+b Homework Threads, Blocks, Grids CUDA threads are organized into blocks Threads operate in SIMD(ish) manner -- each executing same

More information

Introduction to CELL B.E. and GPU Programming. Agenda

Introduction to CELL B.E. and GPU Programming. Agenda Introduction to CELL B.E. and GPU Programming Department of Electrical & Computer Engineering Rutgers University Agenda Background CELL B.E. Architecture Overview CELL B.E. Programming Environment GPU

More information

PROGRAMMING NVIDIA GPUS WITH CUDANATIVE.JL

PROGRAMMING NVIDIA GPUS WITH CUDANATIVE.JL DEPARTMENT ELECTRONICS AND INFORMATION SYSTEMS COMPUTER SYSTEMS LAB PROGRAMMING NVIDIA GPUS WITH CUDANATIVE.JL Tim Besard 2017-06-21 TABLE OF CONTENTS 1. GPU programming: what, why, how 2. CUDAnative.jl

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

Mathematical computations with GPUs

Mathematical computations with GPUs Master Educational Program Information technology in applications Mathematical computations with GPUs CUDA Alexey A. Romanenko arom@ccfit.nsu.ru Novosibirsk State University CUDA - Compute Unified Device

More information

Kepler Overview Mark Ebersole

Kepler Overview Mark Ebersole Kepler Overview Mark Ebersole TFLOPS TFLOPS 3x Performance in a Single Generation 3.5 3 2.5 2 1.5 1 0.5 0 1.25 1 Single Precision FLOPS (SGEMM) 2.90 TFLOPS.89 TFLOPS.36 TFLOPS Xeon E5-2690 Tesla M2090

More information

CUDA 5 and Beyond. Mark Ebersole. Original Slides: Mark Harris 2012 NVIDIA

CUDA 5 and Beyond. Mark Ebersole. Original Slides: Mark Harris 2012 NVIDIA CUDA 5 and Beyond Mark Ebersole Original Slides: Mark Harris The Soul of CUDA The Platform for High Performance Parallel Computing Accessible High Performance Enable Computing Ecosystem Introducing CUDA

More information

HPVM: Heterogeneous Parallel Virtual Machine

HPVM: Heterogeneous Parallel Virtual Machine HPVM: Heterogeneous Parallel Virtual Machine Maria Kotsifakou Department of Computer Science University of Illinois at Urbana-Champaign kotsifa2@illinois.edu Prakalp Srivastava Department of Computer Science

More information

OpenACC and the Cray Compilation Environment James Beyer PhD

OpenACC and the Cray Compilation Environment James Beyer PhD OpenACC and the Cray Compilation Environment James Beyer PhD Agenda A brief introduction to OpenACC Cray Programming Environment (PE) Cray Compilation Environment, CCE An in depth look at CCE 8.2 and OpenACC

More information

Supporting Data Parallelism in Matcloud: Final Report

Supporting Data Parallelism in Matcloud: Final Report Supporting Data Parallelism in Matcloud: Final Report Yongpeng Zhang, Xing Wu 1 Overview Matcloud is an on-line service to run Matlab-like script on client s web browser. Internally it is accelerated by

More information

Lecture 1: an introduction to CUDA

Lecture 1: an introduction to CUDA Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Overview hardware view software view CUDA programming

More information

Platform Support o Additional OS support - Windows Vista 32-bit - Windows Vista 64-bit

Platform Support o Additional OS support - Windows Vista 32-bit - Windows Vista 64-bit NVIDIA CUDA Windows XP and Vista Release Notes Version 2.0 New Features Hardware Support o Additional hardware support: - GeForce GTX 280 - GeForce GTX 260 - GeForce 9800 GX2 - GeForce 9800 GTX - GeForce

More information

NVIDIA GPU CODING & COMPUTING

NVIDIA GPU CODING & COMPUTING NVIDIA GPU CODING & COMPUTING WHY GPU S? ARCHITECTURE & PROGRAM MODEL CPU v. GPU Multiprocessor Model Memory Model Memory Model: Thread Level Programing Model: Logical Mapping of Threads Programing Model:

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware

More information

Accelerated Test Execution Using GPUs

Accelerated Test Execution Using GPUs Accelerated Test Execution Using GPUs Vanya Yaneva Supervisors: Ajitha Rajan, Christophe Dubach Mathworks May 27, 2016 The Problem Software testing is time consuming Functional testing The Problem Software

More information

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015 Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015 Oct 1: Introduction to OpenACC Oct 6: Office Hours Oct 15: Profiling and Parallelizing with the OpenACC Toolkit

More information