Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo

Similar documents
CUDA C/C++ BASICS. NVIDIA Corporation

CUDA C/C++ BASICS. NVIDIA Corporation

An Introduction to GPU Computing and CUDA Architecture

CUDA C/C++ Basics GTC 2012 Justin Luitjens, NVIDIA Corporation

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation

Lecture 6b Introduction of CUDA programming

CUDA C Programming Mark Harris NVIDIA Corporation

HPCSE II. GPU programming and CUDA

GPU Programming Introduction

CS377P Programming for Performance GPU Programming - I

CUDA Exercises. CUDA Programming Model Lukas Cavigelli ETZ E 9 / ETZ D Integrated Systems Laboratory

High Performance Linear Algebra on Data Parallel Co-Processors I

Introduction to CUDA C

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Introduction to CUDA C

GPU programming. Dr. Bernhard Kainz

Tesla Architecture, CUDA and Optimization Strategies

CUDA Parallel Programming Model. Scalable Parallel Programming with CUDA

CUDA Parallel Programming Model Michael Garland

Massively Parallel Algorithms

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

CUDA Programming. Aiichiro Nakano

Module 2: Introduction to CUDA C

$ cd ex $ cd 1.Enumerate_GPUs $ make $./enum_gpu.x. Enter in the application folder Compile the source code Launch the application

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

LDetector: A low overhead data race detector for GPU programs

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

CUDA Architecture & Programming Model

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

GPU Programming. Parallel Patterns. Miaoqing Huang University of Arkansas 1 / 102

Module 2: Introduction to CUDA C. Objective

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Patterns: Graph Search

CUDA (Compute Unified Device Architecture)

Lecture 2: CUDA Programming

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Scientific discovery, analysis and prediction made possible through high performance computing.

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

COSC 462. CUDA Basics: Blocks, Grids, and Threads. Piotr Luszczek. November 1, /10

Lecture 10!! Introduction to CUDA!

GPU 1. CSCI 4850/5850 High-Performance Computing Spring 2018

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

A case study of performance portability with OpenMP 4.5

GPU Programming Using CUDA

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer

Basics of CADA Programming - CUDA 4.0 and newer

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

CS179 GPU Programming Recitation 4: CUDA Particles

Introduction to GPGPU and GPU-architectures

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Optimizing CUDA for GPU Architecture. CSInParallel Project

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

Introduction to CUDA (1 of n*)

NVIDIA GPU CODING & COMPUTING

Parallel Numerical Algorithms

Introduction to Parallel Programming

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Vector Addition on the Device: main()

simcuda: A C++ based CUDA Simulation Framework

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

Lecture 2: Introduction to CUDA C

Hands-on CUDA Optimization. CUDA Workshop

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

GPU applications within HEP

Parallel Computing. Lecture 19: CUDA - I

An Introduction to OpenAcc

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

Shared Memory and Synchronizations

Lecture 3: Introduction to CUDA

Compiling for GPUs. Adarsh Yoga Madhav Ramesh

GPU CUDA Programming

High-Performance Computing Using GPUs

MIC-GPU: High-Performance Computing for Medical Imaging on Programmable Graphics Hardware (GPUs)

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

Hardware/Software Co-Design

GPU programming basics. Prof. Marco Bertini

CUDA-GDB: The NVIDIA CUDA Debugger

GPU COMPUTING. Ana Lucia Varbanescu (UvA)

Device Memories and Matrix Multiplication

B. Tech. Project Second Stage Report on

Advanced CUDA Optimizations

CSE 591: GPU Programming. Programmer Interface. Klaus Mueller. Computer Science Department Stony Brook University

ECE 574 Cluster Computing Lecture 17

GPU programming: CUDA. Sylvain Collange Inria Rennes Bretagne Atlantique

Optimizing Parallel Reduction in CUDA

ECE 574 Cluster Computing Lecture 15

GPU programming CUDA C. GPU programming,ii. COMP528 Multi-Core Programming. Different ways:

Practical Introduction to CUDA and GPU

Lecture 1: Introduction and Computational Thinking

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

Getting Started with CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator

Transcription:

Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015

Goals Running CUDA code on CPUs. Why? Performance portability! A major challenge faced today by developers on heterogeneous high performance computers. We want to write the best possible code using the best possible frameworks and run it on the best possible hardware. Our code should run well even after the translation on a platform that does not needs to have an NVIDIA graphic card. CUDA is an effective data-parallel programming model for more than just GPU architectures?

CUDA Computational Model Data-parallel model of computation Data split into a 1D, 2D, 3D grid of blocks Each block can be 1D, 2D, 3D in shape More than 512 threads/block Built-in variables: dim3 threadidx dim3 blockidx dim3 blockdim dim3 griddim Device Grid 1 Block (0,0,0) Block (0,1,0) Block (0,1,0) Block (1,0,0) Block (1,1,0) Block (2,0,0) Block (2,1,0)

Mapping The mapping of the computation is NOT straightforward. Conceptually easiest implementation: spawn a CPU thread for every GPU thread specified in the programming model. Quite inefficient: mitigates the locality benefits incurs a large amount of scheduling overhead

Mapping Translating the CUDA program such that the mapping of programming constructs maintains the locality expressed in the programming model with existing operating system and hardware features. CUDA blocks execution is asynchronous Each CPU thread should be scheduled to a single core for locality Maintain the ordering semantics imposed by potential barrier synchronization points CUDA C++ block std::thread / Task asynchronous thread sequential unrolled for loop (can be vectorized) synchronous (barriers)

Source-to-source translation Why don t just find a way to compile it for x86? Because having the output source code would be nice! We would like to analyze the obtained code: further optimizations debugging We don t want to focus only on x86

Working with ASTs What is an Abstract Syntax Tree? while (x < 10) { x := x + 1; A tree representation of the abstract syntactic structure of source code written in a while programming language. < := Why regular expression tools aren t powerful enough? x 10 x + Almost all programming languages are (deterministic) context free languages, a x 1 superset of regular languages.

Clang Clang is a compiler front end for the C, C++, Objective-C and Objective-C++ programming languages. It uses LLVM as its back end. Clang s AST closely resembles both the written C++ code and the C++ standard. This makes Clang s AST a good fit for refactoring tools. Clang handles CUDA syntax!

Vector addition: host translation int main(){... sum<<<numblocks, numthreadsperblock>>>(in_a, in_b, out_c);... int main() {... for(i = 0; i < numblocks; i++) sum(in_a, in_b, out_c, i);...

Vector addition: kernel translation global void sum(int* a, int* b, int* c){... c[threadidx.x] = a[threadidx.x] + b[threadidx.x];... void sum(int* a, int* b, int* c, dim3 blockidx, dim3 blockdim){ dim3 threadidx;... //Thread_Loop for(threadidx.x.=0; threadidx.x < blockdim.x; threadidx.x++){ c[threadidx.x] = a[threadidx.x] + b[threadidx.x];... Not built-in variables anymore! The loops enumerate the values of the previously implicit threadidx

Example: Stencil Consider applying a 1D stencil to a 1D array of elements Each output element is the sum of input elements within a radius Fundamental to many algorithms Standard discretization methods, interpolation, convolution, filtering If radius is 3, then each output element is the sum of 7 input elements: radius radius

Sharing Data Between Threads Terminology: within a block, threads share data via shared memory Declare using shared, allocated per block Data is not visible to threads in other blocks

Implementing with shared memory Cache data in shared memory Read (blockdim.x + 2 * radius) input elements from global memory to shared memory Compute blockdim.x output elements Write blockdim.x output elements to global memory halo on left halo on right blockdim.x output elements Each block needs a halo of radius elements at each boundary

Stencil kernel global void stencil_1d(int *in, int *out) { shared int temp[block_size + 2 * RADIUS]; int global_id = threadidx.x + blockidx.x * blockdim.x; int local_id = threadidx.x + RADIUS; // Read input elements into shared memory temp[local_id] = in[global_id]; if (threadidx.x < RADIUS) { temp[local_id - RADIUS] = in[global_id - RADIUS]; temp[local_id + BLOCK_SIZE] = in[global_id + BLOCK_SIZE]; // Synchronize (ensure all the data is available) syncthreads();

Stencil kernel // Apply the stencil int result = 0; for(int offset = -RADIUS ; offset <= RADIUS ; offset++) result += temp[local_id + offset]; // Store the result out[global_id] = result;

Stencil: CUDA kernel global void stencil_1d(int *in, int *out) { shared int temp[block_size + 2 * RADIUS]; int global_id = threadidx.x + blockidx.x * blockdim.x; int local_id = threadidx.x + radius; // Read input elements into shared memory temp[local_id] = in[global_id]; if (threadidx.x < RADIUS) { temp[local_id RADIUS] = in[global_id RADIUS]; temp[local_id + BLOCK_SIZE] = in[global_id + BLOCK_SIZE]; // Synchronize (ensure all the data is available) syncthreads(); // Apply the stencil int result = 0; for(int offset = -RADIUS ; offset <= RADIUS ; offset++) result += temp[local_id + offset]; // Store the result out[global_id] = result;

Stencil: kernel translation (1) void stencil_1d(int *in, int *out, dim3 blockdim, dim3 blockidx) { int temp[block_size + 2 * RADIUS]; int global_id; int local_id; THREAD_LOOP_BEGIN{ global_id = threadidx.x + blockidx.x * blockdim.x; local_id = threadidx.x + radius; // Read input elements into shared memory temp[local_id] = in[global_id]; if (threadidx.x < RADIUS) { temp[local_id RADIUS] = in[global_id RADIUS]; temp[local_id + BLOCK_SIZE] = in[global_id + BLOCK_SIZE]; THREAD_LOOP_END THREAD_LOOP_BEGIN{ // Apply the stencil int result = 0; for(int offset = -RADIUS ; offset <= RADIUS ; offset++) result += temp[local_id + offset]; out[global_id] = result; // Store the result THREAD_LOOP_END A thread loop implicitly introduces a barrier synchronization among logical threads at its boundaries

Stencil: kernel translation (2) void stencil_1d(int *in, int *out, dim3 blockdim, dim3 blockidx) { int temp[block_size + 2 * RADIUS]; int global_id[]; int local_id[]; THREAD_LOOP_BEGIN{ global_id[tid] = threadidx.x + blockidx.x * blockdim.x; local_id[tid] = threadidx.x + RADIUS; // Read input elements into shared memory temp[local_id[tid]] = in[global_id[tid]]; if (threadidx.x < RADIUS) { temp[local_id[tid] - RADIUS] = in[global_id[tid] - RADIUS]; temp[local_id[tid] + BLOCK_SIZE] = in[global_id[tid] + BLOCK_SIZE]; THREAD_LOOP_END THREAD_LOOP_BEGIN{ // Apply the stencil int result = 0; for(int offset = -RADIUS ; offset <= RADIUS ; offset++) result += temp[local_id[tid] + offset]; out[global_id[tid]] = result; // Store the result THREAD_LOOP_END We create an array of values for each local variable of the former CUDA threads. Statements within thread loops access these arrays by loop index.

Benchmark Used source code Time (ms) Slowdown wrt CUDA CUDA¹ 3.41406 1 Translated TBB² 9.41103 2.76 Native sequential³ 22.451 6.58 Native TBB² 14.129 4.14 1 Not optimized CUDA code, memory copies included 2 - We wrapped the call to the function in a TBB parallel for 3 - Not optimized, naive implementation Intel Core i7-4771 CPU @ 3.50GHz NVIDIA Tesla K40 Kepler GK110B

Conclusion One of the most challenging aspects was obtaining the ASTs from CUDA source code Solved including CUDA headers at compile time Now we can match all CUDA syntax At the moment: kernel translation almost completed. Interesting results, but still a good amount of work to do.

Future work Handle CUDA API in the host code (i.e. cudamallocmanaged) Atomic operations in the kernel Adding vectorization pragmas

Thank you GitHub https://github.com/hpc4hep/cudatocpp References Felice Pantaleo, Introduction to GPU programming using CUDA

(Backup) Example: syncthread inside a while statement while(j>0){ foo(); syncthreads(); bar(); bool newcond []; thread_loop { newcond [tid] = (j[tid]>0); LABEL: thread_loop{ if(newcond[tid]) { foo(); thread_loop{ if(newcond[tid] { bar(); thread_loop{ newcond[tid] = (j[tid]>0); if(newcond[tid]) go = true; if(go) gotolabel