Advanced Topics in CUDA C

Similar documents
Vector Addition on the Device: main()

GPU computing Simulating spin models on GPU Lecture 1: GPU architecture and CUDA programming. GPU computing. GPU computing.

CUDA Architecture & Programming Model

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

ECE 574 Cluster Computing Lecture 15

CS 179: GPU Computing. Lecture 2: The Basics

Graphics Interoperability

GPGPU/CUDA/C Workshop 2012

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

Lecture 10!! Introduction to CUDA!

CS377P Programming for Performance GPU Programming - I

Mathematical computations with GPUs

Speed Up Your Codes Using GPU

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

CS179: GPU Programming Recitation 5: Rendering Fractals

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Scientific discovery, analysis and prediction made possible through high performance computing.

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

ECE 574 Cluster Computing Lecture 17

Tesla Architecture, CUDA and Optimization Strategies

Programming with CUDA, WS09

An Introduction to OpenAcc

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

Massively Parallel Algorithms

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Module 2: Introduction to CUDA C

CUDA C Programming Mark Harris NVIDIA Corporation

AMS 148 Chapter 6: Histogram, Sort, and Sparse Matrices

NVIDIA CUDA Compute Unified Device Architecture

Introduction to CUDA C

CUDA Basics. July 6, 2016

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

High Performance Linear Algebra on Data Parallel Co-Processors I

Optimizing CUDA for GPU Architecture. CSInParallel Project

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci

GPU Computing: Introduction to CUDA. Dr Paul Richmond

CUDA C/C++ BASICS. NVIDIA Corporation

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

CUDA Kenjiro Taura 1 / 36

CS179 GPU Programming Recitation 4: CUDA Particles

ECE 408 / CS 483 Final Exam, Fall 2014

Introduction to CUDA C

Module 2: Introduction to CUDA C. Objective

High-Performance Computing Using GPUs

GPU Programming Using CUDA

Introduction to CUDA

Process Time Comparison between GPU and CPU

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

INTRODUCTION TO GPU COMPUTING IN AALTO. Topi Siro

CUDA Programming. Aiichiro Nakano

Introduction to CUDA (1 of n*)

GPU Programming. Rupesh Nasre.

GPGPU in Film Production. Laurence Emms Pixar Animation Studios

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

Basic CUDA workshop. Outlines. Setting Up Your Machine Architecture Getting Started Programming CUDA. Fine-Tuning. Penporn Koanantakool

INTRODUCTION TO GPU COMPUTING IN AALTO. Topi Siro

CSCI-243 Exam 1 Review February 22, 2015 Presented by the RIT Computer Science Community

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer

Introduction to Parallel Programming

Introduction to CUDA Programming

CUDA C/C++ BASICS. NVIDIA Corporation

Cartoon parallel architectures; CPUs and GPUs

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

GPU Programming. Rupesh Nasre.

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

simcuda: A C++ based CUDA Simulation Framework

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation

Lecture 2: Introduction to CUDA C

Tutorial: Parallel programming technologies on hybrid architectures HybriLIT Team

Real-time Graphics 9. GPGPU

JTSK Programming in C II C-Lab II. Lecture 3 & 4

NVIDIA CUDA DEBUGGER CUDA-GDB. User Manual

CS 380 Introduction to Computer Graphics. LAB (1) : OpenGL Tutorial Reference : Foundations of 3D Computer Graphics, Steven J.

Real-time Graphics 9. GPGPU

CUDA Programming Model

Parallel Numerical Algorithms

CS516 Programming Languages and Compilers II

University of Bielefeld

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

HW4-2. float phi[size][size]={};!

Supporting Data Parallelism in Matcloud: Final Report

Introduction to GPGPUs and to CUDA programming model

Zero-copy. Table of Contents. Multi-GPU Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Zero-copy. Multigpu.

Parallel Computing. Lecture 19: CUDA - I

Parallel Processing Neural Networks on SIMD/GPU Architectures by Derek Kern CSC7551, December 8th, 2011

Advanced Pointer Topics

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

CUDA. Sathish Vadhiyar High Performance Computing

04. CUDA Data Transfer

CUDA-GDB: The NVIDIA CUDA Debugger

Content. In this chapter, you will learn:

Transcription:

Advanced Topics in CUDA C S. Sundar and M. Panchatcharam August 9, 2014 S. Sundar and M. Panchatcharam ( IIT Madras, ) Advanced CUDA August 9, 2014 1 / 36

Outline 1 Julia Set 2 Julia GPU 3 Compilation 4 Bitonic Sort bitonic Code S. Sundar and M. Panchatcharam ( IIT Madras, ) Advanced CUDA August 9, 2014 2 / 36

Julia Set The following example will demonstrate code to draw slices of the Julia set Julia set is the boundary of certain class of functions over complex numbers A fun example of fractal Julia set evaluates a simple iterative equation for point in the complex plane A point is not in the set if the process of iterating the equation diverges for that point If a sequence of values produced by iterating the equation grows toward infinity, a point is considered outside the set If the values taken by the equation remain bounded, the point is in the set S. Sundar and M. Panchatcharam ( IIT Madras, ) Advanced CUDA August 9, 2014 3 / 36

Julia set Z n+1 = Z 2 n + C (1) It involves squaring the current value and adding a constant to get the next value of the equation Since it is more complicated program, we will split it into pieces Note: You have to use some of the header files, which is not discussed in detail here S. Sundar and M. Panchatcharam ( IIT Madras, ) Advanced CUDA August 9, 2014 4 / 36

Julia main function int main ( void ) { CPUBitmap bitmap ( DIM, DIM ); unsigned char * ptr = bitmap. get_ptr (); kernel ( ptr ); bitmap. display_and_exit (); The main function is so simple S. Sundar and M. Panchatcharam ( IIT Madras, ) Advanced CUDA August 9, 2014 5 / 36

Julia main function void kernel ( unsigned char * ptr ){ for ( int y =0; y<dim ; y ++) { for ( int x =0; x<dim ; x ++) { int offset = x + y * DIM ; int juliavalue = julia ( x, y ); ptr [ offset *4 + 0] = 255 * juliavalue ; ptr [ offset *4 + 1] = 0; ptr [ offset *4 + 2] = 0; ptr [ offset *4 + 3] = 255; S. Sundar and M. Panchatcharam ( IIT Madras, ) Advanced CUDA August 9, 2014 6 / 36

Julia main function int julia ( int x, int y ) { const float scale = 1.5; float jx = scale * ( float )( DIM /2 - x)/( DIM /2) ; float jy = scale * ( float )( DIM /2 - y)/( DIM /2) ; cucomplex c( -0.8, 0.156) ; cucomplex a( jx, jy); int i = 0; for (i =0; i <200; i ++) { a = a * a + c; if ( a. magnitude2 () > 1000) return 0; return 1; S. Sundar and M. Panchatcharam ( IIT Madras, ) Advanced CUDA August 9, 2014 7 / 36

Julia main function # include " book.h" # include " cpu_bitmap. h" # define DIM 1000 struct cucomplex { float r; float i; cucomplex ( float a, float b ) : r( a), i( b) { float magnitude2 ( void ) { return r * r + i * i; cucomplex operator *( const cucomplex & a) { return cucomplex (r*a.r - i*a.i, i*a.r + r*a.i); cucomplex operator +( const cucomplex & a) { return cucomplex ( r+a.r, i+a. i); ; S. Sundar and M. Panchatcharam ( IIT Madras, ) Advanced CUDA August 9, 2014 8 / 36

The kernel computation iterates through all points we care to render, calling julia() The julia() will return 1 if the point is in the set, 0 if it is not in the set The point color is 1 if julia() returns 1 and black if it returns 0 These two colors are arbitrary and you can use any color S. Sundar and M. Panchatcharam ( IIT Madras, ) Advanced CUDA August 9, 2014 9 / 36

This julia() is the main part of this example Translated our pixel coordinate to a coordinate in complex space To center the complex plane at the image center, we shift by DIM/2. To ensure that the image spans the range of -1.0 to 1.0, we scale the image coordinate by DIM/2 Given an image point at (x,y), we get a point in complex space at ( (DIM/2 x)/(dim/2),((dim/2 y)/(dim/2) ) S. Sundar and M. Panchatcharam ( IIT Madras, ) Advanced CUDA August 9, 2014 10 / 36

To zoom in or out, introduce a scale factor 1.5, But you should tweak this parameter to zoom in or out After obtaining the point in complex space, we then need to determine whether the point is in or out of the Julia Set Remember Z n+1 = Z 2 n + C Since C is some arbitrary complex-valued constant, we have chosen -0.8 + 0.156i because it happens to yield an interesting picture Play with this constant if you want to see other versions of the Julia Set S. Sundar and M. Panchatcharam ( IIT Madras, ) Advanced CUDA August 9, 2014 11 / 36

Memory S. Sundar and M. Panchatcharam ( IIT Madras, ) Advanced CUDA August 9, 2014 12 / 36

Since all the computations are being performed on complex numbers, we define a generic structure to store complex numbers The class represents complex numbers with two data elements: a single-precision real component r and a single-precision imaginary component i The class defines addition and multiplication operators that combine complex numbers as expected S. Sundar and M. Panchatcharam ( IIT Madras, ) Advanced CUDA August 9, 2014 13 / 36

Structure Complex structure changed for GPU # include " book.h" # include " cpu_bitmap. h" # define DIM 1000 struct cucomplex { float r; float i; host device cucomplex ( float a, float b ) : r( a), i( b) { device float magnitude2 ( void ) { return r * r + i * i; device cucomplex operator *( const cucomplex & a) { return cucomplex (r*a.r - i*a.i, i*a.r + r*a.i); device cucomplex operator +( const cucomplex & a) { return cucomplex ( r+a.r, i+a. i); ; S. Sundar and M. Panchatcharam ( IIT Madras, ) Advanced CUDA August 9, 2014 14 / 36

Julia Changes julia function as device device int julia ( int x, int y ) { const float scale = 1.5; float jx = scale * ( float )( DIM /2 - x)/( DIM /2) ; float jy = scale * ( float )( DIM /2 - y)/( DIM /2) ; cucomplex c( -0.8, 0.156) ; cucomplex a( jx, jy); int i = 0; for (i =0; i <200; i ++) { a = a * a + c; if ( a. magnitude2 () > 1000) return 0; return 1; S. Sundar and M. Panchatcharam ( IIT Madras, ) Advanced CUDA August 9, 2014 15 / 36

Julia Kernel Same Kernel global void kernel ( unsigned char * ptr ) { // map from blockidx to pixel position int x = blockidx. x; int y = blockidx. y; int offset = x + y * griddim. x; // now calculate the value at that position int juliavalue = julia ( x, y ); ptr [ offset *4 + 0] = 255 * juliavalue ; ptr [ offset *4 + 1] = 0; ptr [ offset *4 + 2] = 0; ptr [ offset *4 + 3] = 255; S. Sundar and M. Panchatcharam ( IIT Madras, ) Advanced CUDA August 9, 2014 16 / 36

Julia Main // globals needed by the update routine struct DataBlock { unsigned char * dev_bitmap ; ; int main ( void ) { DataBlock data ; CPUBitmap bitmap ( DIM, DIM, & data ); unsigned char * dev_bitmap ; HANDLE_ERROR ( cudamalloc ( ( void **) & dev_bitmap, bitmap. image_size () ) ); data. dev_bitmap = dev_bitmap ; dim3 grid (DIM, DIM ); kernel <<<grid,1 > > >( dev_bitmap ); HANDLE_ERROR ( cudamemcpy ( bitmap. get_ptr (), dev_bitmap, bitmap. image_size (), cudamemcpydevicetohost ) ); HANDLE_ERROR ( cudafree ( dev_bitmap ) ); bitmap. display_and_exit (); S. Sundar and M. Panchatcharam ( IIT Madras, ) Advanced CUDA August 9, 2014 17 / 36

We then run our kernel() function exactly like in the CPU version, although now it is a global function, run on the GPU. Similar to CPU, we pass kernel() the pointer we allocated in the previous line to store the results The only difference is that the memory resides on the GPU now, not on the host system. S. Sundar and M. Panchatcharam ( IIT Madras, ) Advanced CUDA August 9, 2014 18 / 36

Compilation W specify a two-dimensional grid of blocks in this line: dim3 grid(dim,dim); The type dim3 is not a standard C type, The type dim3 represents a three-dimensional tuple that will be used to specify the size of our launch S. Sundar and M. Panchatcharam ( IIT Madras, ) Advanced CUDA August 9, 2014 19 / 36

Compilation To run this example, we require gl, glut, glut3 and glu files which is available in Ubuntu To get those files, install freeglut3-dev To install, type in terminal sudo apt-get install freeglut3-dev For compilation, nvcc -lglut -lgl -lglu juliagpu.cu S. Sundar and M. Panchatcharam ( IIT Madras, ) Advanced CUDA August 9, 2014 20 / 36

specify architecture while compiling CUDA Sometimes it happens we need to specify the architecture For example, if your code contain atomicadd() function then your SM architecture should be > SM10 Otherwise you get atomicadd undefined identifier In this case, specify nvcc arch=sm 30 Or nvcc -gencode-arch=compute_30,code=sm_30,compute_30 S. Sundar and M. Panchatcharam ( IIT Madras, ) Advanced CUDA August 9, 2014 21 / 36

Bitonic Sort S. Sundar and M. Panchatcharam ( IIT Madras, ) Advanced CUDA August 9, 2014 22 / 36

Bitonic Sort S. Sundar and M. Panchatcharam ( IIT Madras, ) Advanced CUDA August 9, 2014 23 / 36

Bitonic Sort S. Sundar and M. Panchatcharam ( IIT Madras, ) Advanced CUDA August 9, 2014 24 / 36

Bitonic Sort S. Sundar and M. Panchatcharam ( IIT Madras, ) Advanced CUDA August 9, 2014 25 / 36

Bitonic Sort S. Sundar and M. Panchatcharam ( IIT Madras, ) Advanced CUDA August 9, 2014 26 / 36

Bitonic Sort S. Sundar and M. Panchatcharam ( IIT Madras, ) Advanced CUDA August 9, 2014 27 / 36

Bitonic Sort S. Sundar and M. Panchatcharam ( IIT Madras, ) Advanced CUDA August 9, 2014 28 / 36

Bitonic Sort- Summary S. Sundar and M. Panchatcharam ( IIT Madras, ) Advanced CUDA August 9, 2014 29 / 36

Code # include < stdlib.h> # include <stdio.h> # include <time.h> /* Every thread gets exactly one value in the unsorted array. */ # define THREADS 512 // 2^9 # define BLOCKS 32768 // 2^15 # define NUM_VALS THREADS * BLOCKS void print_elapsed ( clock_t start, clock_t stop ) { double elapsed = (( double ) ( stop - start )) / CLOCKS_PER_SEC ; printf (" Elapsed time : %.3 fs\n", elapsed ); S. Sundar and M. Panchatcharam ( IIT Madras, ) Advanced CUDA August 9, 2014 30 / 36

Code float random_float () { return ( float ) rand () /( float ) RAND_MAX ; void array_print ( float * arr, int length ) { int i; for ( i = 0; i < length ; ++i) { printf (" %1.3 f ", arr [i]); printf ("\n"); S. Sundar and M. Panchatcharam ( IIT Madras, ) Advanced CUDA August 9, 2014 31 / 36

Code void array_fill ( float * arr, int length ) { srand ( time ( NULL )); int i; for ( i = 0; i < length ; ++i) { arr [ i] = random_float (); global void bitonic_sort_step ( float * dev_values, int j, int k) { unsigned int i, ixj ; /* Sorting partners : i and ixj */ i = threadidx. x + blockdim. x * blockidx. x; ixj = i^j; S. Sundar and M. Panchatcharam ( IIT Madras, ) Advanced CUDA August 9, 2014 32 / 36

Code /* The threads with the lowest ids sort the array. */ if (( ixj )>i) { if ((i&k) ==0) { /* Sort ascending */ if ( dev_values [i]> dev_values [ ixj ]) { /* exchange (i, ixj ); */ float temp = dev_values [ i]; dev_values [ i] = dev_values [ ixj ]; dev_values [ ixj ] = temp ; if ((i&k)!=0) { /* Sort descending */ if ( dev_values [i]< dev_values [ ixj ]) { /* exchange (i, ixj ); */ float temp = dev_values [ i]; dev_values [ i] = dev_values [ ixj ]; dev_values [ ixj ] = temp ; S. Sundar and M. Panchatcharam ( IIT Madras, ) Advanced CUDA August 9, 2014 33 / 36

Code /* * * Inplace bitonic sort using CUDA. */ void bitonic_sort ( float * values ) { float * dev_values ; size_t size = NUM_VALS * sizeof ( float ); cudamalloc (( void **) & dev_values, size ); cudamemcpy ( dev_values, values, size, cudamemcpyhosttodevice ); dim3 blocks ( BLOCKS,1) ; /* Number of blocks */ dim3 threads ( THREADS,1) ; /* Number of threads */ int j, k; /* Major step */ for ( k = 2; k <= NUM_VALS ; k <<= 1) { /* Minor step */ for (j=k > >1; j >0; j=j > >1) { bitonic_sort_step <<< blocks, threads >>>( dev_values, j, k); S. Sundar and M. Panchatcharam ( IIT Madras, ) Advanced CUDA August 9, 2014 34 / 36

Code cudamemcpy ( values, dev_values, size, cudamemcpydevicetohost ); cudafree ( dev_values ); int main ( void ) { clock_t start, stop ; float * values = ( float *) malloc ( NUM_VALS * sizeof ( float )); array_fill ( values, NUM_VALS ); start = clock (); bitonic_sort ( values ); /* Inplace */ stop = clock (); print_elapsed ( start, stop ); S. Sundar and M. Panchatcharam ( IIT Madras, ) Advanced CUDA August 9, 2014 35 / 36

THANK YOU S. Sundar and M. Panchatcharam ( IIT Madras, ) Advanced CUDA August 9, 2014 36 / 36