Lecture 6 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden. Computing with Graphical Processing Units CUDA Programming Matrix multiplication

Similar documents
CSE 160 Lecture 24. Graphical Processing Units

Lecture 3. SIMD and Vectorization GPU Architecture

Lecture 3. Programming with GPUs

Lecture 5. Performance Programming with CUDA

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

Lecture 7. Overlap Using Shared Memory Performance programming the memory hierarchy

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Lecture 7. Using Shared Memory Performance programming and the memory hierarchy

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Parallel Computing. Lecture 19: CUDA - I

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

Introduction to CUDA Programming

Lecture 3: Introduction to CUDA

Tesla Architecture, CUDA and Optimization Strategies

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Parallel Accelerators

Parallel Accelerators

Module 2: Introduction to CUDA C

Lecture 10!! Introduction to CUDA!

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

High Performance Computing and GPU Programming

COSC 6374 Parallel Computations Introduction to CUDA

Lecture 6. Programming with Message Passing Message Passing Interface (MPI)

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Introduction to Parallel Computing with CUDA. Oswald Haan

GPU Computing: Introduction to CUDA. Dr Paul Richmond

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

GPU CUDA Programming

Module 2: Introduction to CUDA C. Objective

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

GPU Architecture and Programming. Andrei Doncescu inspired by NVIDIA

HPCSE II. GPU programming and CUDA

Lecture 2: Introduction to CUDA C

CUDA Programming Model

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Programmable Graphics Hardware (GPU) A Primer

CUDA Programming. Aiichiro Nakano

CUDA C Programming Mark Harris NVIDIA Corporation

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Lecture 11: GPU programming

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Programming in CUDA. Malik M Khan

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

EEM528 GPU COMPUTING

GPU Fundamentals Jeff Larkin November 14, 2016

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

Computation to Core Mapping Lessons learned from a simple application

Basics of CADA Programming - CUDA 4.0 and newer

Lecture 10. Stencil Methods Limits to Performance

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

University of Bielefeld

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop

Double-Precision Matrix Multiply on CUDA

Lessons learned from a simple application

Introduction to GPGPUs and to CUDA programming model

CS 179 Lecture 4. GPU Compute Architecture

GPU Programming with CUDA. Pedro Velho

Data Parallel Execution Model

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690

COSC 6339 Accelerators in Big Data

Introduction to CUDA (1 of n*)

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

Scientific discovery, analysis and prediction made possible through high performance computing.

Using a GPU in InSAR processing to improve performance

Cartoon parallel architectures; CPUs and GPUs

Parallel Numerical Algorithms

Introduction to CUDA

GPU programming: CUDA. Sylvain Collange Inria Rennes Bretagne Atlantique


CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

High-Performance Computing Using GPUs

Vector Addition on the Device: main()

COSC 6385 Computer Architecture. - Data Level Parallelism (II)

Mattan Erez. The University of Texas at Austin

Overview: Graphics Processing Units

CUDA Parallelism Model

Device Memories and Matrix Multiplication

CUDA Architecture & Programming Model

CS : Many-core Computing with CUDA

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Lecture 2: CUDA Programming

Module 3: CUDA Execution Model -I. Objective

CUDA (Compute Unified Device Architecture)

Introduction to GPU Computing. Design and Analysis of Parallel Algorithms

Transcription:

Lecture 6 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden Computing with Graphical Processing Units CUDA Programming Matrix multiplication

Announcements A2 has been released: Matrix multiplication with CUDA You ll optimize matrix multiply to at least 360 Gflops/sec Scott B. Baden / CSE 260, UCSD / Fall '15 2

Today s lecture Introduction to GPUs Architecture Programing (CUDA) Scott B. Baden / CSE 260, UCSD / Fall '15 3

Recapping from last time GPUs process long vectors on thousands of specialized cores Executing 1000s of threads to hide data motion Some regularity involving memory accesses and control flow Scott B. Baden / CSE 260, UCSD / Fall '15 4

Tesla Kepler K20m/K80 (GK110/210) Hierarchically organized clusters of streaming multiprocessors u u 13 streaming processors @ 705 (745) MHz [x2 for K80] Peak performance: 1.2 (2.4) Tflops/s Double Precision, fused multiply/add 5 (11¼ ) GB device memory (frame buffer)@208 (240) GB/s Stampede has device capability 3.5, Sorken 3.7 64K, 128K 32 bit registers Nvidia 10/12/15 7.1B transistors Scott B. Baden / CSE 260, UCSD / Fall '15 5

Overview of Kepler GK110 10/12/15 Scott B. Baden / CSE 260, UCSD / Fall '15 6

SMX Streaming processor AKA vector unit Stampede s K20s (GK110 GPU) have 13 SMXs (2496 cores), so do K80s Each vector unit 192 SP cores, 64 DP cores, 32 SFUs, 32 Load/Store units Each scalar core: fused multiply adder, truncates intermediate result 64KB (112K) on-chip memory configurable as scratchpad memory + L1 $ 64K (128K) x 32-bit registers (256 (512) KB) up to 255/thread 1 FMA /cycle = 2 flops / cyc / DP core * 64 DP/SMX * 13 SMX = 1664 flops/cyc @0.7006 Ghz = 1.165 TFLOPS per processr (2.33 for K80) Nvidia Scott B. Baden / CSE 260, UCSD / Fall '15 7

Nvidia Scott B. Baden / CSE 260, UCSD / Fall '15 8

Kepler s Memory Hierarchy DRAM takes hundreds of cycles to access Can partition the on-chip Shared memory L,1$ cache {¾ + ¼} {¾ + ¼} {½ + ½} L2 Cache (1.5 MB) B. Wilkinson Scott B. Baden / CSE 260, UCSD / Fall '15 9

CUDA Programming environment with extensions to C Under control of the host, invoke sequences of multithreaded kernels on the device (GPU) Many lightweight threads CUDA: programming environment + C extensions KernelA<<4,8>> KernelB<<4,8>> KernelC<<4,8>> 10/12/15 Scott B. Baden / CSE 260, UCSD / Fall '15 10

execution model Kernel call spawns virtualized, hierarchically organized threads Grid Hardware dispatches blocks to cores, 0 overhead Compiler re-arranges loads to hide latencies Global synchronization: kernel invocation Global Memory..... KernelA<<<2,3>,<3,5>>>() Scott B. Baden / CSE 260, UCSD / Fall '15 11

Execution Configurations Grid Expressed with configuration variables Programmer sets the thread block size, maps threads to memory locations Each thread uniquely specified by block & thread ID global void Kernel (...); Kernel Device (1, 1) (0, 0) (0, 1) (0, 2) (1, 0) (1, 1) (1, 2) Grid 1 (0, 0) (0, 1) (2, 0) (2, 1) (2, 2) (1, 0) (1, 1) (3, 0) (3, 1) (3, 2) (4, 0) (4, 1) (4, 2) (2, 0) (2, 1) dim3 DimGrid(40,30); // 1200 thread blocks dim3 Dim(3,5,10); // 150 threads /block DavidKirk/NVIDIA & Wen-mei Hwu/UIUC Kernel<<< DimGrid, Dim, >>>(...); 10/12/15 Scott B. Baden / CSE 260, UCSD / Fall '15 13

execution s Unit of workload assignment Each thread has its own set of registers All have access to a fast on-chip shared memory Synchronization only among all threads in a block s in different blocks communicate via slow global memory Processor groups threads into warps of 32 threads SIMT parallelism: all threads in a warp execute the same instruction All branches followed Instructions disabled Divergence, serialization KernelA<<<2,3>,<3,5>>>() Grid SMX MT IU SP Shared Memory (1, 1) t0 t1 t2 tm Device Grid 1 (0, 0) (0, 1) Grid 2 (1, 0) (1, 1) (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) Scott B. Baden / CSE 260, UCSD / Fall '15 14 (2, 0) (2, 1) DavidKirk/NVIDIA & Wen-mei Hwu/UIUC

Coding example Increment Array Serial Code void incrementarrayonhost(float *a, int N){ int i; for (i=0; i < N; i++) a[i] = a[i]+1.f; } CUDA // Programmer determines the mapping of virtual thread IDs // to global memory locations #include <cuda.h> global void incrementondevice(float *a, int N) { // Each thread uniquely specified by block & thread ID int idx = blockidx.x*blockdim.x + threadidx.x; if (idx<n) a[idx] = a[idx]+1.f; } incrementondevice <<< ns, blocksize >>> (a_d, N); Rob Farber, Dr Dobb s Journal 10/12/15 Scott B. Baden / CSE 260, UCSD / Fall '15 15

Managing memory Data must be allocated on the device and host and moved between the two explicitly float *a_h, *b_h; float *a_d; // pointers to host memory // pointer to device memory cudamalloc((void **) &a_d, size); for (i=0; i<n; i++) a_h[i] = (float)i; // init host data cudamemcpy(a_d, a_h, sizeof(float)*n, cudamemcpyhosttodevice); Scott B. Baden / CSE 260, UCSD / Fall '15 16

Computing and returning result int bsize = 4; int ns = N/bSize + (N%bSize == 0?0:1); incrementondevice <<< ns, bsize >>> (a_d, N); // Retrieve result from device and store in b_h cudamemcpy(b_h, a_d, sizeof(float)*n, cudamemcpydevicetohost); // check results for (i=0; i<n; i++) assert(a_h[i] == b_h[i]); // cleanup free(a_h); free(b_h); cudafree(a_d); Scott B. Baden / CSE 260, UCSD / Fall '15 17

Experiments - increment benchmark Total time: timing taken from the host, includes copying data to the device Device only: time taken on device only N = 8388480, block size = 128, times in milliseconds, cseclass02 Reps = 10 100 1000 10 4 10 5 3.3 36 358 3.58s 35.8s Device time 71. 102 429 3.64s 35.9s Kernel launch + data xfer 92. 730 7.06s -- -- Host 6.8 52 500 a[i] = 1 + sin(a[i]) : Device) 6.4s 23.9s 200s Sine function (Host) Scott B. Baden / CSE 260, UCSD / Fall '15 18

What is the cost of moving the data and launching the kernel? A. About 6.8 ms ((71-3)/10) B. About 0.66 ms (102-36)/100 C. About 0.071 ms ((429-358)/1000) D. About 68 ms (71-3) N = 8 M block size = 128, times in milliseconds Reps = 10 100 1000 10 4 10 5 3.3 36 358 3.58s 35.8s Device time (increment) 71. 102 429 3.64s 35.9s Kernel launch + data xfer Scott B. Baden / CSE 260, UCSD / Fall '15 19

Naïve implementation of matrix multiply Each thread computes one element of C Loads a row of matrix A Loads a column of matrix B Computes a dot product Every value of A and B is loaded N times from global memory (2, 2) B 2 4 2 6 3 2 5 4 48 BLOCK_SIZE A C Courtesy DavidKirk/NVIDIA and Wen-mei Hwu/UIUC Scott B. Baden / CSE 260, UCSD / Fall '15 20

Naïve Host Implementation // ijk kernel C[i,j] for i := 0 to n-1 for j := 0 to n-1 for k := 0 to n-1 C[i,j] += A[i,k] * B[k,j] A[i,:] += * B[:,j] for (unsigned int i = 0; i < N; i++) for (unsigned int j = 0; j < N; j++) { DOUBLE sum = 0; for (unsigned int k = 0; k < N; k++) sum += A[i * N + k] * B[k * N + j]; C[i * N + j] = (DOUBLE) sum; } Scott B. Baden / CSE 260, UCSD / Fall '15 21

Naïve Kernel global void matmul(double* C, double* A, double* B) { int I = blockidx.x*blockdim.x + threadidx.x; int J = blockidx.y*blockdim.y + threadidx.y; } int N = blockdim.y*griddim.y; if ((I < N) && (J < N)){ float _c = 0; for (unsigned int k = 0; k < N; k++) { } double a = A[I * N + k]; double b = B[k * N + J]; _c += a * b; } C[I * N + J] = _c; // Assumes a square matrix for (unsigned int i = 0; i < N; i++) for (unsigned int j = 0; j < N; j++) { DOUBLE sum = 0; for (unsigned int k = 0; k < N; k++) sum += A[i * N + k] * B[k * N + j]; C[i * N + j] = (DOUBLE) sum; } Scott B. Baden / CSE 260, UCSD / Fall '15 22

CUDA code on the host side unsigned int n2 = N*N*sizeof(DOUBLE); // Generic types DOUBLE *h_a = (DOUBLE*) malloc(n2); // in types.h DOUBLE *h_b = (DOUBLE*) malloc(n2); // Check that allocations went OK assert(h_a); assert(h_b); genmatrix(h_a, N, N); genmatrix(h_b, N, N); // Initialize matrices DOUBLE *d_a, *d_b, *d_c; cudamalloc((void**) &d_a, n2);... &d_a... &d_b checkcudaerror("error allocating device memory arrays"); // copy host memory to device cudamemcpy(d_a, h_a, n2, cudamemcpyhosttodevice); checkcudaerror("error copying data to device"); cudamemcpy(d_b, h_b, n2, cudamemcpyhosttodevice); checkcudaerror("error copying data to device"); Scott B. Baden / CSE 260, UCSD / Fall '15 23

Host code - continued // setup execution configurations dim3 threads(ntx, nty,1); // ntx & nty are user input dim3 grid(n / threads.x, N / threads.y); // launch the kernel matmul<<< grid, threads >>>(d_c, d_a, d_b); // retrieve result cudamemcpy(h_c, d_c, n2, cudamemcpydevicetohost); checkcudaerror("unable to retrieve result from device"); // Free device storage assert(cudasuccess ==cudafree(d_a)); assert(cudasuccess ==cudafree(d_b)); assert(cudasuccess ==cudafree(d_c)); Scott B. Baden / CSE 260, UCSD / Fall '15 24

Configuration variables Types to manage thread geometries dim3 griddim, blockdim Dimensions of the grid in blocks Dimensions of a thread block in threads dim3 blockidx, threadidx; index within the grid index within the block global void KernelFunc(...); dim3 DimGrid(2,3); // 6 thread blocks dim3 Dim(3,5,1); // 15 threads /block Kernel<<< DimGrid, Dim, >>>(...); Kernel Device (1, 1) (0, 0) (0, 1) (0, 2) (1, 0) (1, 1) (1, 2) Grid 1 (0, 0) (0, 1) (2, 0) (2, 1) (2, 2) (1, 0) (1, 1) (3, 0) (3, 1) (3, 2) (4, 0) (4, 1) (4, 2) (2, 0) (2, 1) DavidKirk/NVIDIA & Wen-mei Hwu/UIUC Scott B. Baden / CSE 260, UCSD / Fall '15 25

mapping global void matmul(double* C, double* A, double* B) { int I = blockidx.x*blockdim.x + threadidx.x; int J = blockidx.y*blockdim.y + threadidx.y; int N = blockdim.y*griddim.y;// Square matrix if ((I < N) && (J < N)){ float _c = 0; for (unsigned int k = 0; k < N; k++) { } } double a = A[I * N + k]; double b = B[k * N + J]; _c += a * b; } C[I * N + J] = _c; M A N B P C tx ty WIDTH WIDTH Courtesy DavidKirk/NVIDIA and Wen-mei Hwu/UIUC WIDTH WIDTH Scott B. Baden / CSE 260, UCSD / Fall '15 26

Performance N=512, double precision Stampede, K20, 2.0 GHz Intel Xeon E5-2680 0 @ 2.70GHz peak 21.6 GF / core Geometry GFLOPS 1 x 128 64 2 x 128 76 4 x 64 49 16 x 16 16 1 x 512 63 2 x 256 78 4 x 128 50 1 x 1024 62 2 x 512 79 4 x 256 51 # cores GFLOPS 1 22 2 43 4 84 8 160 Scott B. Baden / CSE 260, UCSD / Fall '15 27

Parallel Speedup How much did our GPU implementation improve over the traditional processor? Speedup, S Running time of the fastest program on conventional processors Running time of the accelerated program Baseline: a multithreaded program 10/12/15 Scott B. Baden / CSE 260, UCSD / Fall '15 28

Two ways Measuring performance Use an ordinary timer, e.g. gettimeofday() Use Cuda events/elapsed time (#ifdef CUDA_TIMER) See incrarray Note that kernel invocation is asynchronous cudasynchronize(); double t_device_compute = -gettime(); incr<<< ns, bsize >>> (a_d, N); cudasynchronize(); t_device_compute +=gettime(); Scott B. Baden / CSE 260, UCSD / Fall '15 29

CUDA Error Handling Cuda error: Can't run kernel: invalid device function. Cuda can silently fail, you can observe misleading performance E.g. if you specify an invalid grid / thread block dimensions Note: the last error can be cleared by successive kernel calls, so check frequently cudamalloc((void **) &a_d, size); checkcudaerror("unable to allocate storage on the device"); Consult checkcudaerror() in utils.cu (incrarr) What about asynchronous calls? cf CUDA Programming Guide, Error Handling Scott B. Baden / CSE 260, UCSD / Fall '15 30

Getting information about the binary Compiler will report a kernel s register usage along with that of local, shared and constant memory --ptxas-options=-v incrementarrays (float *a, int N) int idx = blockidx.x*blockdim.x + threadidx.x; if (idx<n) a[idx] = a[idx]+1.f; ptxas info : Compiling entry function '_Z6matMuliPdS_S_' for 'sm_37' ptxas info : Function properties for _Z6matMuliPdS_S_ 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads ptxas info : Used 26 registers, 352 bytes cmem[0] 10/12/15 Scott B. Baden / CSE 260, UCSD / Fall '15 31

Next Time How to use shared memory to improve significantly the performance of matrix multiply Scott B. Baden / CSE 260, UCSD / Fall '15 32