Lessons learned from a simple application

Similar documents
Computation to Core Mapping Lessons learned from a simple application

Matrix Multiplication in CUDA. A case study

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop

GPGPU. Lecture 2: CUDA

Using The CUDA Programming Model

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Lecture 1: Introduction

CIS 665: GPU Programming. Lecture 2: The CUDA Programming Model

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

Parallel Computing. Lecture 19: CUDA - I

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Module 3: CUDA Execution Model -I. Objective

Lecture 3: Introduction to CUDA

Data Parallel Execution Model

GPU Memory Memory issue for CUDA programming

CSE 591: GPU Programming. Memories. Klaus Mueller. Computer Science Department Stony Brook University

GPU Memory. GPU Memory. Memory issue for CUDA programming. Copyright 2013 Yong Cao, Referencing UIUC ECE498AL Course Notes

CS 677: Parallel Programming for Many-core Processors Lecture 1

Nvidia G80 Architecture and CUDA Programming

CUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission)

Many-core Processors Lecture 1. Instructor: Philippos Mordohai Webpage:

GPU Architecture and Programming. Ozcan Ozturk Bilkent University Ankara, Turkey

GPU Computation CSCI 4239/5239 Advanced Computer Graphics Spring 2014

Module Memory and Data Locality

Introduction to GPGPUs and to CUDA programming model

Application Acceleration Using the Massive Parallel Processing Power of GPUs

Introduction to CUDA (2 of 2)

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

Introduction to CUDA (1 of n*)

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

COMP 322: Fundamentals of Parallel Programming

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

CUDA Parallelism Model

1/25/12. Administrative

Hardware accelerators: an introduction to modern hpc. CISM/CÉCI HPC training session Fall 2014

CPUs+GPUs. PCIe: 1GB/s per lane x16 => 16GB/s full duplex. DMI2 20 Gb/s 40x PCIe C L1 L2. 40x PCIe. 2 QPI links 2x 19.2GB/s Full Duplex.

GPU Computing with CUDA

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

ECE/CS 757: Advanced Computer Architecture II GPGPUs

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

MIC & GPU Architecture. Dr. Momme Allalen, LRZ, PRACE PATC: Intel MIC&GPU Programming Workshop

Reduction of a Symmetrical Matrix. to Tridiagonal Form on GPUs

ECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications

Lecture 5. Performance Programming with CUDA

CS 179: GPU Computing. Lecture 2: The Basics

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

ME964 High Performance Computing for Engineering Applications

Module Memory and Data Locality

Sparse Linear Algebra in CUDA

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

ECE 408 / CS 483 Final Exam, Fall 2014

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

Introduction to CUDA (1 of n*)

High Performance Linear Algebra on Data Parallel Co-Processors I

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Lecture 6 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden. Computing with Graphical Processing Units CUDA Programming Matrix multiplication

Tesla Architecture, CUDA and Optimization Strategies

Dense Linear Algebra. HPC - Algorithms and Applications

Hardware/Software Co-Design

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

Convolution, Constant Memory and Constant Caching

Introduction to CUDA

GPU Computing with CUDA

CUDA. Sathish Vadhiyar High Performance Computing

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Overview: Graphics Processing Units

Lecture 2: Introduction to CUDA C

An Example of Physical Reality Behind CUDA. GeForce Speedup of Applications Why Massively Parallel Processor.

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

COSC 6374 Parallel Computations Introduction to CUDA

ME964 High Performance Computing for Engineering Applications

Lecture 10!! Introduction to CUDA!

Double-Precision Matrix Multiply on CUDA

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Tiled Matrix Multiplication

Module 2: Introduction to CUDA C. Objective

CS : Many-core Computing with CUDA

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Device Memories and Matrix Multiplication

CUDA C Programming Mark Harris NVIDIA Corporation

Lecture 7. Using Shared Memory Performance programming and the memory hierarchy

Massively Parallel Architectures

CSE 160 Lecture 24. Graphical Processing Units

GPU Computing with CUDA

CUDA Programming. Aiichiro Nakano

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

Module 2: Introduction to CUDA C

CUDA Architecture & Programming Model

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

Computational Fluid Dynamics (CFD) using Graphics Processing Units

GPU Computing: Introduction to CUDA. Dr Paul Richmond

E6895 Advanced Big Data Analytics Lecture 8: GPU Examples and GPU on ios devices

Performance Diagnosis for Hybrid CPU/GPU Environments

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Transcription:

Computation to Core Mapping Lessons learned from a simple application

A Simple Application Matrix Multiplication Used as an example throughout the course Goal for today: Show the concept of Computation-to-Core Mapping Block schedule, Occupancy, and thread schedule Assumption Deal with square matrix for simplicity Leave memory issues later With global memory and local registers 2

The algorithm and CPU code P = M * N of size x // Matrix multiplication on the (CPU) host in double precision void MatrixMulOnHost(float* M, float* N, float* P, int Width) { for (int i = 0; i < Width; ++i) for (int j = 0; j < Width; ++j) { double sum = 0; for (int k=0; k < Width; ++k) { double a = M[i * width + k]; M double b = N[k * width + j]; sum += a * b; i } P[i * Width + j] = sum; } } k N P j k WI IDTH 3

The algorithm and CPU code P = M * N of size x // Matrix multiplication on the (CPU) host in double precision void MatrixMulOnHost(float* M, float* N, float* P, int Width) { for (int i = 0; i < Width; ++i) for (int j = 0; j < Width; ++j) { Pay attention double sum = 0; here! for (int k=0; k < Width; ++k) { double a = M[i * width + k]; M double b = N[k * width + j]; sum += a * b; i } P[i * Width + j] = sum; } } k N P j k WI IDTH 4

First Mapping Scheme Thread mapping: N Define the finest computational unit, and map it onto each thread Main criterion : None Dependency In our first scheme: Unit: Calculation of one element of P tx Block mapping: Simple: One block M ty P (tx, ty) 5

Step 1: Memory layout M 0,0 M 1,0 M 2,0 M 3,0 M 0,1 M 1,1 M 2,1 M 3,1 M 0,2 M 1,2 M 2,2 M 3,2 M (column#, row#) M 0,3 M 1,3 M 2,3 M 3,3 M M 0,0 M 1,0 M 2,0 M 3,0 M 0,1 M 1,1 M 2,1 M 3,1 M 0,2 M 1,2 M 2,2 M 3,2 M 0,3 M 1,3 M 2,3 M 3,3

Step 2: Input Matrix Data Transfer (Host (Host Code) void MatrixMulOnDevice(float* M, float* N, float* P, int Width) { int size = Width * Width * sizeof(float); float* Md, Nd, Pd; 1. // Allocate and Load M, N to device memory cudamalloc(&md, size); cudamemcpy(md, M,size,cudaMemcpyHostToDevice); cudamalloc(&nd, size); cudamemcpy(nd, N, size, cudamemcpyhosttodevice); // Allocate P on the device cudamalloc(&pd, size);

Step 3: Output t Matrix Data Transfer (Host (Host Code) 2. // Kernel invocation code to be shown later 3. // Read P from the device cudamemcpy(p, Pd, size, cudamemcpydevicetohost); // Free device matrices cudafree(md); cudafree(nd); cudafree (Pd); }

Step 4: Kernel Function // Matrix multiplication kernel per thread code global void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) { // Pvalue is used to store the element of the matrix // that is computed by the thread float Pvalue = 0; 9

Step 4: Kernel Function (cont.) for (int k = 0; k < Width; ++k) { float Melement = Md[threadIdx.y*Width+k]; float Nelement = Nd[k*Width+threadIdx.x]; Pvalue += Melement * Nelement; } Pd[threadIdx.y*Width+threadIdx.x] = Pvalue; } Nd tx k Md Pd ty ty k tx 10

Step 5: Kernel Invocation (Host Code) // Setup the execution configuration dim3 dimgrid(1, 1); dim3 dimblock(width, Width); // Launch the device computation threads! MatrixMulKernel<<<dimGrid, dimblock>>>(md, Nd, Pd, Width);

Issues with the First Mapping Scheme One Block of threads compute matrix Pd Other Multi-processors are not used. Grid 1 Block 1 Thread (2, 2) Nd 2 4 2 6 3 2 5 4 48 Md Pd

Issues with the First Mapping Scheme Each thread Loads a row of matrix Md Loads a column of matrix Nd Perform one multiply and addition for each pair of Md and Nd elements Compute to off-chip memory access ratio close to 1:1 (not very high) Grid 1 Block 1 Thread (2, 2) Nd 2 4 2 6 3 2 5 4 48 Md Pd

Issues with the First Mapping Scheme Size of matrix limited by the number of threads allowed in a thread block Maximum threads per block: 512 Can only do 22 x 22 matrix You can put a loop around the kernel call for cases when Width > 22. But multiple kernel launch will cost you. Grid 1 Block 1 Thread (2, 2) Nd 2 4 2 6 3 2 5 4 48 Md Pd

Solution: the Second Mapping Scheme Thread mapping: the same with the first one Block mapping: Create 2D thread blocks, each of which compute a (TILE_) 2 sub-matrix (tile) of the result matrix Each has (TILE_) 2 threads Generate a 2D Grid of (/TILE _ ) 2 blocks Md Nd Pd by TILE_ bx tx ty 15

About the Second Mapping More blocks (/TILE_) 2 Support larger matrix The maximum size of each dimension of a grid of thread blocks is 65535. Max Width = 65535 2 x TILE_ 2 Use more multi-processors Nd Md Pd by bx TILE_ ty tx 16

Algorithm concept using tiles Break-up Pd into tiles Each block calculates one tile Each thread calculates one element Block size equal tile size Nd bx 0 1 2 tx 012 TILE_-1 0 Md Pd by 1 ty 0 1 2 Pd sub TILE_WIDT THE TILE_-1 TILE_ 2 17

Example Block(0,0) Block(1,0) P 0,0 P 1,0 P 0,1 1,1 P 2,0 P 3,0 TILE_ = 2 P 1 1 P 2,1 P 3,1 P 0,2 P 1,2 P 2,2 P 3,2 P 0,3 P 1,3 P 2,3 P 3,3 Block(0,1) Block(1,1)

Block Computation ti Nd 0,00 Nd 1,0 Nd 0,1 Nd 1,1 Nd 0,2Nd 1,2 Nd 0,3 Nd 1,3 Md 0,0 Md 1,0 Md 2,0 Md 3,0 Pd 0,0 Pd 1,0 Pd 2,0 Pd 3,0 Md 0,1 Md 1,1 Md 2,1 Md 3,1 Pd 0,1 Pd 1,1 Pd 2,1 Pd 3,1 Pd 0,2 Pd 1,2 Pd 2,2 Pd 3,2 Pd 0,3 Pd 1,3 Pd 2,3 Pd 3,3

Kernel Code using Tiles global void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) { // Calculate the row index of the Pd element and M int Row = blockidx.y*tile_ + threadidx.y; // Calculate the column idenx of Pd and N int Col = blockidx.x*tile_ x*tile + threadidx.x; x; float Pvalue = 0; // each thread computes one element of the block sub-matrix for (int k = 0; k < Width; ++k) Pvalue += Md[Row*Width+k] * Nd[k*Width+Col]; Pd[Row*Width+Col] = Pvalue; }

Revised Kernel Invocation (Host Code) // Setup the execution configuration dim3 dimgrid (Width/TILE_, Width/TILE_); ); dim3 dimblock (TILE_, TILE_); // Launch the device computation threads! MatrixMulKernel<<<dimGrid, dimblock>>>(md, Nd, Pd, Width);

Questions? For Matrix Multiplication using multiple blocks, should I use 8X8, 16X16 or 32X32 blocks? Why?

Block Scheduling t0 t1 t2 tm SM 0 SM 1 t0 t1 t2 tm MT IU SP MT IU SP Blocks Blocks Shared Memory Shared Memory Up to 8 blocks to each SM as resource allows SM in G80 can take up to 768 threads Could be 256 (threads/block) * 3 blocks Or 128 (threads/block) * 6 blocks, etc. SM in GT200 can take up to 1024 threads 23

Thread scheduling in Multiprocessing i Block 1 Warps Block 2 Warps t0 t1 t2 t31 t0 t1 t2 t31 Block 1 Warps t0 t1 t2 t31 Each Block is executed as 32-thread Warps If 3 blocks are assigned to an SM and each block has 256 threads, how many Warps are thereinansm? Each Block is divided into 256/32 = 8 Warps There are 8 * 3 = 24 Warps Streaming Multiprocessor Instruction L1 Instruction Fetch/Dispatch Shared Memory SP SP SP SP SFU SFU SP SP SP SP

Occupancy of Multiprocessor How much a Multiprocessor is occupied: Occupancy = Actually warps / Totally allowed GT200 SM allows 32 warps G80 SM allow 24 warps For example: One block per SM, 32 threads per block (32/32) / 32 = 3.125% (Very bad) 4 blocks per SM, 256 threads per block (256/32) * 4 / 32 = 100% (Very good)

CUDA Occupancy Calculator l There are three factors: Maximum number of warps Maximum registers usage Maximum share memory usage Demo

Answers to Our Questions For Matrix Multiplication using multiple blocks, should I use 8X8, 16X16 or 32X32 blocks? For G80 GPU: For 8X8, we have 64 threads per Block. Since each SM can take up to 768 threads, there are 12 Blocks. However, each SM can only take up to 8 Blocks, only 512 threads will go into each SM! (Occupancy = 66.6%) 6%) For 16X16, we have 256 threads per Block. Since each SM can take up to 768 threads, it can take up to 3 Blocks and achieve full capacity unless other resource considerations overrule. (Occupancy = 100%) For 32X32, we have 1024 threads per Block. Not even one can fit into an SM! (Can not support)

Answers to Our Questions (Cont ) For Matrix Multiplication using multiple blocks, should I use 8X8, 16X16 or 32X32 blocks? For GT200 GPU: For 8X8, we have 64 threads per Block. Since each SM can take up to 1024 threads, there are 16 Blocks. However, each SM can only take up to 8 Blocks, only 512 threads will go into each SM! (Occupancy =50%) For 16X16, we have 256 threads per Block. Each SM takes 4 Blocks and achieve full capacity unless other resource considerations overrule. (Occupancy = 100%) For 32X32, we have 1024 threads per Block. Each SM takes 1 Block and achieve full capacity unless other resource considerations overrule. (Occupancy = 100%)

Computation-to ti to-core Mapping Step 1: Define your computational unit, map each unit to a thread Step 2: Avoid dependency Increase compute to memory access ratio Group your threads into blocks Eliminate hardware limit Take advantage of shared memory