Introduction to Parallel Programming

Similar documents
GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

Speed Up Your Codes Using GPU

GPU Programming Using CUDA

GPU Programming Using NVIDIA CUDA

CUDA C Programming Mark Harris NVIDIA Corporation

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

GPU Programming. Parallel Patterns. Miaoqing Huang University of Arkansas 1 / 102

High Performance Computing and GPU Programming

Hardware/Software Co-Design

Practical Introduction to CUDA and GPU

Vector Addition on the Device: main()

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

Kurt Schmidt. October 30, 2018

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Supporting Data Parallelism in Matcloud: Final Report

Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo

CUDA. More on threads, shared memory, synchronization. cuprintf

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

Parallel Computer Architecture and Programming Written Assignment 2

ECE 408 / CS 483 Final Exam, Fall 2014

GPU Programming Introduction

Convolution, Constant Memory and Constant Caching

Module 3: CUDA Execution Model -I. Objective

Warps and Reduction Algorithms

Introduction to CUDA C

Information Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86)

Massively Parallel Algorithms

Lecture 2: Introduction to CUDA C

GPGPU/CUDA/C Workshop 2012

Module 2: Introduction to CUDA C. Objective

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

CUDA. GPU Computing. K. Cooper 1. 1 Department of Mathematics. Washington State University

ECE 574 Cluster Computing Lecture 15

High Performance Linear Algebra on Data Parallel Co-Processors I

M3-R4: PROGRAMMING AND PROBLEM SOLVING THROUGH C LANGUAGE

Aryan College. Fundamental of C Programming. Unit I: Q1. What will be the value of the following expression? (2017) A + 9

Introduction to CUDA C

Tesla Architecture, CUDA and Optimization Strategies

An Introduction to GPU Computing and CUDA Architecture

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

Scientific discovery, analysis and prediction made possible through high performance computing.

ECE 574 Cluster Computing Lecture 17

GPU Computing Master Clss. Development Tools

Lecture 10!! Introduction to CUDA!

Language comparison. C has pointers. Java has references. C++ has pointers and references

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Stream Computing using Brook+

CUDA Advanced Techniques 3 Mohamed Zahran (aka Z)

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

CS 677: Parallel Programming for Many-core Processors Lecture 6

Note: unless otherwise stated, the questions are with reference to the C Programming Language. You may use extra sheets if need be.

First of all, it is a variable, just like other variables you studied

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

MCA Semester 1. MC0061 Computer Programming C Language 4 Credits Assignment: Set 1 (40 Marks)

Stanford CS149: Parallel Computing Written Assignment 3

Processes. Johan Montelius KTH

GPU programming basics. Prof. Marco Bertini

Arrays and Pointers in C. Alan L. Cox

CS 241 Data Organization Pointers and Arrays

A process. the stack

Intermediate Programming, Spring 2017*

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer

CME 213 S PRING Eric Darve

C BOOTCAMP DAY 2. CS3600, Northeastern University. Alan Mislove. Slides adapted from Anandha Gopalan s CS132 course at Univ.

Data Parallel Execution Model

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Multi-Processors and GPU

Introduction to GPGPUs and to CUDA programming model

Atomic Operations. Atomic operations, fast reduction. GPU Programming. Szénási Sándor.

Brook+ Data Types. Basic Data Types

Introduction to GPU programming. Introduction to GPU programming p. 1/17

CSE101-lec#12. Designing Structured Programs Introduction to Functions. Created By: Amanpreet Kaur & Sanjeev Kumar SME (CSE) LPU

A complex expression to evaluate we need to reduce it to a series of simple expressions. E.g * 7 =>2+ 35 => 37. E.g.

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

Computers Programming Course 6. Iulian Năstac

Parallel Computing. Lecture 19: CUDA - I

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Advanced Topics in CUDA C

AMS 148 Chapter 6: Histogram, Sort, and Sparse Matrices

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

GPU Programming for Mathematical and Scientific Computing

PRINCIPLES OF OPERATING SYSTEMS

Large and Sparse Mass Spectrometry Data Processing in the GPU Jose de Corral 2012 GPU Technology Conference

Pointers, Dynamic Data, and Reference Types

Advanced CUDA Optimizations

Tutorial: Parallel programming technologies on hybrid architectures HybriLIT Team

COSC 6374 Parallel Computations Introduction to CUDA

Variation of Pointers

Massively Parallel Computing with CUDA. Carlos Alberto Martínez Angeles Cinvestav-IPN

At this time we have all the pieces necessary to allocate memory for an array dynamically. Following our example, we allocate N integers as follows:

GPU Programming. Rupesh Nasre.

vec4f a =...; vec4f b; mat4f M =...; b = M * a; Figure 1: Transforming vector a by matrix M and storing the result in b. 2.2 Stream Functions The elem

Module 2: Introduction to CUDA C

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

POINTER & REFERENCE VARIABLES

CUDA C/C++ BASICS. NVIDIA Corporation

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

Transcription:

Introduction to Parallel Programming Pablo Brubeck Department of Physics Tecnologico de Monterrey October 14, 2016 Student Chapter Tecnológico de Monterrey Tecnológico de Monterrey Student Chapter

Outline The basics Data types Pointers Parallel programming Hardware Parallel primitives The CUDA programming model Thread Hierarchy Memory Hierarchy Simple code example

Outline The basics Data types Pointers Parallel programming Hardware Parallel primitives The CUDA programming model Thread Hierarchy Memory Hierarchy Simple code example

Hello World! This code simply prints Hello World!. # include <stdlib.h> # include <stdio.h> int main () { printf (" Hello World!\n"); return 0; }

C++ has explicit data types. bool isempty = true ; int particles = 1024; float Tf = 273.15; double phi = 1.61803398875;

Structs are complex data types. We can define the 3D vector data type. struct float3 { float x, y, z; }; float3 pos ; pos.x = 3.2; pos. y = -4.7; pos.z = 1.34;

A pointer is a variable that stores the memory address of another variable. When a variable is declared, the memory needed to store its value is assigned to a specific location.

A pointer is a variable that stores the memory address of another variable. When a variable is declared, the memory needed to store its value is assigned to a specific location. The program does not explicitly decide the exact memory location.

A pointer is a variable that stores the memory address of another variable. When a variable is declared, the memory needed to store its value is assigned to a specific location. The program does not explicitly decide the exact memory location. A pointer is used to access data cells that are at a certain postion relative to it.

Pointers may be obtained from values and viceversa. & is the address-of operator, and can be read simply as address of

Pointers may be obtained from values and viceversa. & is the address-of operator, and can be read simply as address of * is the dereference operator, and can be read as value pointed to by

Pointers may be obtained from values and viceversa. & is the address-of operator, and can be read simply as address of * is the dereference operator, and can be read as value pointed to by

Pointers may be obtained from values and viceversa. & is the address-of operator, and can be read simply as address of * is the dereference operator, and can be read as value pointed to by float val ; \\ val is declared as a variable float * ptr ; \\ ptr is declared as a pointer val =0.577; * ptr =0.577; printf (" Address of val %d\n", & val ); printf (" Address of * ptr %d\n", ptr );

An array may be declared with a pointer. int N = 1024; double * foo = new double [N]; for ( int i =0; i<n; i ++){ foo [i]= cos (pi/n*i )); }

An array may be declared with a pointer. int N = 1024; double * foo = new double [N]; for ( int i =0; i<n; i ++){ foo [i]= cos (pi/n*i )); } The pointer foo refers to the memory address of the first element foo[0].

An array may be declared with a pointer. int N = 1024; double * foo = new double [N]; for ( int i =0; i<n; i ++){ foo [i]= cos (pi/n*i )); } The pointer foo refers to the memory address of the first element foo[0]. The pointer foo+k, where k is an integer, refers to the memory address of the k-th element foo[k].

Matrices are square-shaped arrays. A matrix of size m n may be indexed in column-major storage int m =128, n =256; double *A = new double [m*n];

Matrices are square-shaped arrays. A matrix of size m n may be indexed in column-major storage int m =128, n =256; double *A = new double [m*n]; for ( int i =0; i<m; i ++){ for ( int j =0; j<n; j ++){ A[j*m+i]= cos (pi/m*i*j); } }

Outline The basics Data types Pointers Parallel programming Hardware Parallel primitives The CUDA programming model Thread Hierarchy Memory Hierarchy Simple code example

Why parallelism? If you were plowing a field, which would you rather use: Two strong oxen or 1024 chickens? Seymour Cray, the father of supercomputing.

Parallelization is desired to increase the throughput. Throughput = Number of computations / Amount of time.

Parallelization is desired to increase the throughput. Throughput = Number of computations / Amount of time. Throughput is measured in FLoating-Point Operations Per Second or flops.

The CPU is excellent for serial tasks. The CPU is in charge of coordinating hardware and software.

The CPU is excellent for serial tasks. The CPU is in charge of coordinating hardware and software. Typically, it may have from 2 to 12 processing cores.

The CPU is excellent for serial tasks. The CPU is in charge of coordinating hardware and software. Typically, it may have from 2 to 12 processing cores. Many programs run entirely on the CPU.

The GPU is excellent for parallel tasks. I The GPU is in charge of processing and rendering graphics onto your display.

The GPU is excellent for parallel tasks. I The GPU is in charge of processing and rendering graphics onto your display. I Typically, it may have from hundreds to thousands of processing cores.

The GPU is excellent for parallel tasks. I The GPU is in charge of processing and rendering graphics onto your display. I Typically, it may have from hundreds to thousands of processing cores. I General Purpose GPU Programming takes advantage of this enormous computing power.

The GPU devotes more transistors to data processing. GPUs have many more Arithmetic Processing Units (ALUs).

Parallel primitives are common, highly parallelizable operations Mapping Stencil Reduction Scan or prefix sum

Mapping A mapping is a transformation of N inputs into N outputs where the same unary function is applied independently to every array element.

Mapping A mapping is a transformation of N inputs into N outputs where the same unary function is applied independently to every array element. y i = f (x i ), i = 0, 1,..., N 1.

Mapping A mapping is a transformation of N inputs into N outputs where the same unary function is applied independently to every array element. y i = f (x i ), i = 0, 1,..., N 1. Compute the square of each array element.

Mapping A mapping is a transformation of N inputs into N outputs where the same unary function is applied independently to every array element. y i = f (x i ), i = 0, 1,..., N 1. Compute the square of each array element. Compute the color of each pixel on an image.

Mapping A mapping is a transformation of N inputs into N outputs where the same unary function is applied independently to every array element. y i = f (x i ), i = 0, 1,..., N 1. Compute the square of each array element. Compute the color of each pixel on an image. Transform each vertex of a 3D model.

Stencil A stencil operation is a transformation of N inputs into N outputs where the operator takes some of the neighboring elements.

Stencil A stencil operation is a transformation of N inputs into N outputs where the operator takes some of the neighboring elements. y i = f (..., x i 1, x i, x i+1,...), i = 0, 1,..., N 1.

Stencil A stencil operation is a transformation of N inputs into N outputs where the operator takes some of the neighboring elements. y i = f (..., x i 1, x i, x i+1,...), i = 0, 1,..., N 1. Finite differences.

Stencil A stencil operation is a transformation of N inputs into N outputs where the operator takes some of the neighboring elements. y i = f (..., x i 1, x i, x i+1,...), i = 0, 1,..., N 1. Finite differences. Image filtering.

Stencil A stencil operation is a transformation of N inputs into N outputs where the operator takes some of the neighboring elements. y i = f (..., x i 1, x i, x i+1,...), i = 0, 1,..., N 1. Finite differences. Image filtering. Convolutional Neural Networks.

Reduction A reduction operation takes N inputs to produce a single output with an associative binary operator.

Reduction A reduction operation takes N inputs to produce a single output with an associative binary operator. y = x 0 x 1... x N 1

Reduction A reduction operation takes N inputs to produce a single output with an associative binary operator. y = x 0 x 1... x N 1 Find the total sum of an array.

Reduction A reduction operation takes N inputs to produce a single output with an associative binary operator. y = x 0 x 1... x N 1 Find the total sum of an array. Find the maximum element of an array.

Reduction A reduction operation takes N inputs to produce a single output with an associative binary operator. y = x 0 x 1... x N 1 Find the total sum of an array. Find the maximum element of an array. Find the minimum element of an array.

Scan or prefix sum A scan operation takes N inputs to produce N partial outputs by applying an associative binary operator.

Scan or prefix sum A scan operation takes N inputs to produce N partial outputs by applying an associative binary operator. y i = x 0 x 1... x i, i = 0, 1,..., N 1

Scan or prefix sum A scan operation takes N inputs to produce N partial outputs by applying an associative binary operator. y i = x 0 x 1... x i, i = 0, 1,..., N 1 Find the accumulated sum of an array.

Scan or prefix sum A scan operation takes N inputs to produce N partial outputs by applying an associative binary operator. y i = x 0 x 1... x i, i = 0, 1,..., N 1 Find the accumulated sum of an array. Compute the histogram of a series of data.

Scan or prefix sum A scan operation takes N inputs to produce N partial outputs by applying an associative binary operator. y i = x 0 x 1... x i, i = 0, 1,..., N 1 Find the accumulated sum of an array. Compute the histogram of a series of data. Image processing, High Dynamic Range (HDR).

Parallel primitives are already implemented on libraries Thrust is a CUDA library that allows you to implement high performance parallel applications with minimal programming effort.

Outline The basics Data types Pointers Parallel programming Hardware Parallel primitives The CUDA programming model Thread Hierarchy Memory Hierarchy Simple code example

Parallel code execution is organized into a grid of thread blocks.

Various threads execute serial code simultaneously. These pieces of serial code are known as kernels.

Various threads execute serial code simultaneously. These pieces of serial code are known as kernels. Their execution is grouped into thread blocks.

Various threads execute serial code simultaneously. These pieces of serial code are known as kernels. Their execution is grouped into thread blocks. Thread blocks are grouped into a block grid.

Various threads execute serial code simultaneously. These pieces of serial code are known as kernels. Their execution is grouped into thread blocks. Thread blocks are grouped into a block grid. Threads and blocks may have up to three-dimensional indices.

Parallelism is limited by the GPU. Since the whole task might require more parallel processors that the ones that are available, the task is divided into thread blocks.

Parallelism is limited by the GPU. Since the whole task might require more parallel processors that the ones that are available, the task is divided into thread blocks. All threads within a thread block are ensured to be synchronous.

Memory is shared at different levels.

Memory is shared at different levels. Local memory is the fastest, but it is limited to an individual thread.

Memory is shared at different levels. Local memory is the fastest, but it is limited to an individual thread. Shared memory is shared between threads of the same block.

Memory is shared at different levels. Local memory is the fastest, but it is limited to an individual thread. Shared memory is shared between threads of the same block. Global memory is the slowest, but may be accessed from any thread.

Memory allocation. Oh, no! pointers! The CUDA unified memory model allows sharing between the host and the device. This is how to allocate global memory on the device. int N =1024; float * data ; cudamallocmanaged (& data, N* sizeof ( float ));

Kernel declaration. Oh, no! weird syntax! // Generate equally spaced vector from a to b global void linspace ( int N, float * data, float a, float b){ int i= blockidx.x* blockdim.x+ threadidx.x; data [i]=a+(b-a)*i/(n -1); }

Kernel declaration. Oh, no! weird syntax! // Generate equally spaced vector from a to b global void linspace ( int N, float * data, float a, float b){ int i= blockidx.x* blockdim.x+ threadidx.x; data [i]=a+(b-a)*i/(n -1); } // Compute the square of each array element global void mapsquare ( int N, float * data ){ int i= blockidx.x* blockdim.x+ threadidx.x; data [i]= data [i]* data [i]; }

Kernel invocation. Oh, no! even weirder syntax! // Define grid and block dimensions dim3 block (512); dim3 grid ((N +512-1)/512); // Kernel invocation linspace <<<grid, block >>>(N, data, -10, 10); mapsquare <<<grid, block >>>(N, data );

We forgot our data on the device, we must take it back to the host. // Must do this before passing data // from device to host cudadevicesynchronize (); // Print the result for ( int i =0; i<n; i ++){ printf ("%f\n", data [i ]); }

Conclusion Efficient parallel programming may seem very complicated at first, but its benefits are truly worth the effort.

Conclusion Efficient parallel programming may seem very complicated at first, but its benefits are truly worth the effort. If you really liked it, enroll to the Introduction to Parallel Programming Course on Udacity.

References Pointers - C++ Tutorials. http://www.cplusplus.com/doc/tutorial/pointers/. (Accessed on 10/11/2016). Unified memory in cuda 6 parallel forall. https://devblogs.nvidia.com/parallelforall/ unified-memory-in-cuda-6/. (Accessed on 10/13/2016). Programming guide :: Cuda toolkit documentation. https: //docs.nvidia.com/cuda/cuda-c-programming-guide/. (Accessed on 10/13/2016).