Lecture 3. Programming with GPUs

Similar documents
CSE 160 Lecture 24. Graphical Processing Units

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

Lecture 6 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden. Computing with Graphical Processing Units CUDA Programming Matrix multiplication

Lecture 3. SIMD and Vectorization GPU Architecture

Lecture 3. Vectorization Memory system optimizations Performance Characterization

Lecture 4. Performance: memory system, programming, measurement and metrics Vectorization

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Parallel Computing. Lecture 19: CUDA - I

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Lecture 3: Introduction to CUDA

Tesla Architecture, CUDA and Optimization Strategies

Lecture 5. Performance Programming with CUDA

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Module 2: Introduction to CUDA C

Lecture 2: Introduction to CUDA C

Module 2: Introduction to CUDA C. Objective

Lecture 7. Overlap Using Shared Memory Performance programming the memory hierarchy

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

CUDA C Programming Mark Harris NVIDIA Corporation

Lecture 7. Using Shared Memory Performance programming and the memory hierarchy

CSE 160 Lecture 10. Instruction level parallelism (ILP) Vectorization

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

GPU CUDA Programming

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

Lecture 10!! Introduction to CUDA!

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690

CUDA Architecture & Programming Model

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Lecture 4. Instruction Level Parallelism Vectorization, SSE Optimizing for the memory hierarchy

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Lecture 6. Programming with Message Passing Message Passing Interface (MPI)

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

CUDA Programming. Aiichiro Nakano

Parallel Numerical Algorithms

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

High-Performance Computing Using GPUs

Introduction to CUDA (1 of n*)

High Performance Computing and GPU Programming

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop

Josef Pelikán, Jan Horáček CGG MFF UK Praha

CUDA (Compute Unified Device Architecture)

ECE 574 Cluster Computing Lecture 15

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Introduction to CUDA Programming

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

CUDA Programming Model

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Programming in CUDA. Malik M Khan

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

University of Bielefeld

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

GPU programming. Dr. Bernhard Kainz

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Lecture 11: GPU programming

Introduction to CUDA

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

GPU COMPUTING. Ana Lucia Varbanescu (UvA)

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

COSC 6374 Parallel Computations Introduction to CUDA

GPU Programming Using CUDA

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

CUDA Parallel Programming Model Michael Garland

Introduc)on to GPU Programming

COSC 6385 Computer Architecture. - Data Level Parallelism (II)

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

CUDA Parallel Programming Model. Scalable Parallel Programming with CUDA

ECE 574 Cluster Computing Lecture 17

Introduction to CELL B.E. and GPU Programming. Agenda

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to CUDA (1 of n*)

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci

GPU Architecture and Programming. Andrei Doncescu inspired by NVIDIA

Massively Parallel Architectures

High Performance Linear Algebra on Data Parallel Co-Processors I

Double-Precision Matrix Multiply on CUDA

GPU Computing: Introduction to CUDA. Dr Paul Richmond

COSC 6339 Accelerators in Big Data


Cartoon parallel architectures; CPUs and GPUs

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

Using The CUDA Programming Model

Basic unified GPU architecture

EEM528 GPU COMPUTING

Introduction to GPGPUs and to CUDA programming model

Introduction to Parallel Computing with CUDA. Oswald Haan

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Threading Hardware in G80

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Parallel Accelerators

CIS 665: GPU Programming. Lecture 2: The CUDA Programming Model

1/25/12. Administrative

GPGPU. Alan Gray/James Perry EPCC The University of Edinburgh.

L17: CUDA, cont. 11/3/10. Final Project Purpose: October 28, Next Wednesday, November 3. Example Projects

Parallel Accelerators

Transcription:

Lecture 3 Programming with GPUs

GPU access Announcements lilliput: Tesla C1060 (4 devices) cseclass0{1,2}: Fermi GTX 570 (1 device each) MPI Trestles @ SDSC Kraken @ NICS 2011 Scott B. Baden / CSE 262 / Spring 2011 2

Today s lecture Vector processing (SIMD) Introduction to Stream Processing Graphical Processing Units 2011 Scott B. Baden / CSE 262 / Spring 2011 3

Vector processing 2011 Scott B. Baden / CSE 262 / Spring 2011 4

Streaming SIMD Extensions en.wikipedia.org/wiki/streaming_simd_extensions SSE (SSE4 on Intel Nehalem), Altivec Short vectors: up to 256 bits for i = 0:3 { c[i] = a[i] * b[i];} a0 b0 a3 b3 a b X X X X a0+b0 a3+b3 c 2011 Scott B. Baden / CSE 262 / Spring 2011 5

More about streaming extensions Fused multiply-add Memory accesses must be contiguous and aligned How do we sum the values in a vector? a b X X X X c + + + + r r[0:3] = c[0:3] + a[0:3]*b[0:3] Courtesy of Mercury Computer Systems, Inc. 2011 Scott B. Baden / CSE 262 / Spring 2011 6

Memory interleaving Compensates for slow memory access times Assume we are accessing memory consecutively What happens if the stride = number of banks? 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Ruye Wang, fourier.eng.hmc.edu/e85/lectures/memory 2011 Scott B. Baden / CSE 262 / Spring 2011 7

How do we use the SSE instructions? Low level: assembly language or libraries Higher level: a vectorizing compiler /opt/intel/bin/icpc -O2 -vec-report3 src.c float a[n], b[n], c[n] = ; for (int i=0; i<n; i++) a[i] = b[i] + c[i]; t2a.c(19): (col. 3) remark: LOOP WAS VECTORIZED. Intel Xeon E5504 Gainestown @ 2.00GHz (Iilliput) Double precision With vectorization: 1.48 sec. [0.180 Gflops/s] Without vectorization: 2.96 sec. [0.090 Gflops/s] Single precision With vectorization: 0.574 Gflops/s Without vectorization: 0.142 Gflops/s 2011 Scott B. Baden / CSE 262 / Spring 2011 8

Original code for (int i=0; i<n; i++) a[i] = b[i] + c[i]; Transformed code for (i = 0; i < 1024; i+=4) a[i:i+3] = b[i:i+3] + c[i:i+3]; Vector instructions for (i = 0; i < 1024; i+=4){ vb = vec_ld( &b[i] ); } How does the vectorizer work? vc = vec_ld( &c[i] ); va = vec_add( vb, vc ); vec_st( va, &a[i] ); 2011 Scott B. Baden / CSE 262 / Spring 2011 9

Data dependencies prevent vectorization Data dependencies for (int i = 1; i < N; i++) b[i] = b[i-1] + 2; b[1] = b[0] + 2; b[2] = b[1] + 2; b[3] = b[2] + 2; flow.c(6): warning #592: variable "b" is used before its value is set b[i] = b[i-1] + 2; /* data dependence cycle */ ^ flow.c(5): (col. 1) remark: loop was not vectorized: existence of vector dependence. flow.c(6): (col. 2) remark: vector dependence: assumed FLOW dependence between b line 6 and b line 6. But note different output from C++ compiler: flow.c(6): warning #592: variable "b" is used before its value is set b[i] = b[i-1] + 2; 2011 Scott B. Baden / CSE 262 / Spring 2011 10

What prevents vectorization Interrupted flow out of the loop for (i=0; i<n; i++) { a[i] = b[i] + c[i]; maxval = (a[i] > maxval? a[i] : maxval); if (maxval > 1000.0) break; } t2mx.c(13): (col. 5) remark: loop was not vectorized: nonstandard loop is not a vectorization candidate. This loop will vectorize for (i=0; i<n; i++) { a[i] = b[i] + c[i]; maxval = (a[i] > maxval? a[i] : maxval); } 2011 Scott B. Baden / CSE 262 / Spring 2011 11

Compiler makes conservative decisions Interrupted flow out of the loop void copy(char *p, char *q, int n) { // $PUB/Examples/Vectorization int i; for (i = 0; i < n; i++) p[i] = q[i]; } copy.c(3): (col. 3) remark: loop was not vectorized: not inner loop. copy.c(3): (col. 3) remark: loop was not vectorized: existence of vector dependence. copy.c(3): (col. 27) remark: vector dependence: assumed ANTI dependence between q line 3 and p line 3. copy.c(3): (col. 27) remark: vector dependence: assumed FLOW dependence between p line 3 and q line 3. copy.c(3): (col. 27) remark: vector dependence: assumed FLOW dependence between p line 3 and q line 3. copy.c(3): (col. 27) remark: vector dependence: assumed ANTI dependence between q line 3 and p line 3. 2011 Scott B. Baden / CSE 262 / Spring 2011 12

Run-time data dependence testing Restrict keyword needed to ensure correct semantics http://www.devx.com/tips/tip/13825 During the scope of the pointer declaration, all data accessed through it will be accessed.. through any other pointer [thus] a given object cannot be changed through another pointer. /opt/intel/bin/icpc -O2 -vec-report3 -restrict t2a.c void copy_conserve(char *restrict p, char *restrict q, int n) { int i; if (p+n < q q+n < p) #pragma ivdep for (i = 0; i < n; i++) p[i] = q[i]; /* vector loop */ else for (i = 0; i < n; i++) p[i] = q[i]; /* serial loop */ } copy.c(11): (col. 3) remark: LOOP WAS VECTORIZED. 2011 Scott B. Baden / CSE 262 / Spring 2011 13

Restrictions on vectorization Inner loops only for(int j=0; j< reps; j++) for (int i=0; i<n; i++) a[i] = b[i] + c[i]; t2av.cpp(95): (col. 7) remark: loop was not vectorized: not inner loop. 2011 Scott B. Baden / CSE 262 / Spring 2011 14

Unaligned data movement is expensive Accesses aligned on 16 byte boundaries go faster Intel compiler can handle some alignments http://drdobbs.com/cpp/184401611 foo(double *a,double *b, int N]); for (int i = 1; i < N-1; i++) a[i+1] = b[i] * 3; cannot be vectorized: void fill (char *x){ for (int i = 0; i < 1024; i++) x[i] = 1; } a[0] a[1] a[2] a[3] b[0] b[1] b[2] b[3] Alignment a[2] = b[1] - 1 for (int i = 2; i < N-1; i++) a[i+1] = b[i] * 3 peel = x & 0x0f; if (peel!= 0) { peel = 16 - peel; for (i = 0; i < peel; i++) x[i] = 1; } /* aligned access */ for (i = peel; i < 1024; i++) x[i] = 1; 2011 Scott B. Baden / CSE 262 / Spring 2011 15

Computing with Graphical Processing Units (GPUs) 2011 Scott B. Baden / CSE 262 / Spring 2011 16

Heterogeneous processing Two types of processors: general purpose + accelerator Accelerator can perform certain tasks more quickly subject to various overhead costs Accelerator amplifies relative cost of communication 2011 Scott B. Baden / CSE 262 / Spring 2011 17

NVIDIA GeForce GTX 280 Hierarchically organized clusters of streaming multiprocessors 240 cores @ 1.296 GHz Peak performance 933.12 Gflops/s SIMT parallelism 1 GB device memory (frame buffer) 512 bit memory interface @ 132 GB/s GTX 280: Intel Penryn: 410M (dual core) 1.4B transistors 2011 Scott B. Baden / CSE 262 / Spring 2011 18

Streaming processor cluster GTX-280 GPU 10 clusters @ 3 streaming multiprocessors or vector cores Each vector core 8 scalar cores: fused multiply adder + multiplier (32 bits), truncate intermediate rslt Shared memory (16KB) and registers (64 KB) 64- bit fused multiply-adder + 2 super function units (2 fused multiply-adders) 1 FMA + 1 multiply per cycle = 3 flops / cycle / core * 240 cores = 720 flops/cycle @ 1.296 Ghz: 933 GFLOPS Instruction L1 Data L1 Instruction Fetch/Dispatch Shared Memory SP SP SP SP SFU SP SP SFU SP SP DavidKirk/NVIDIA and Wen-mei Hwu/UIUC 4/5/11 2011 Scott B. Baden / CSE 262 / Spring 2011 19

Streaming Multiprocessor H. Goto 2011 Scott B. Baden / CSE 262 / Spring 2011 20

CUDA Acceleration model Under control of the host, invoke sequences of multithreaded kernels on the device (GPU) Many lightweight threads CUDA: programming environment + C extensions KernelA<<4,8>> KernelB<<4,8>> KernelC<<4,8>> 2011 Scott B. Baden / CSE 262 / Spring 2011 21

Thread execution model Kernel call spawns virtualized, hierarchically organized threads Grid Block Thread Specify number and geometry of threads in a block and similarly for blocks Thread Blocks Subdivide a global index domain Cooperate, synchronize, with access fast on-chip shared memory Threads in different blocks communicate only through slow global memory Threads are lightweight Assigned to SM in units of blocks Compiler re-arranges loads to hide latencies Global synchronization: kernel invocation Host Kernel 1 Kernel 2 Device Block (1, 1) Thread (0, 0) Thread (0, 1) Thread (0, 2) Thread (1, 0) Thread (1, 1) Thread (1, 2) Grid 1 Block (0, 0) Block (0, 1) Grid 2 Thread (2, 0) Thread (2, 1) Thread (2, 2) Block (1, 0) Block (1, 1) Thread (3, 0) Thread (3, 1) Thread (3, 2) Thread (4, 0) Thread (4, 1) Thread (4, 2) KernelA<<<2,3>,<3,5>>> DavidKirk/NVIDIA & Wen-mei Hwu/UIUC Block (2, 0) Block (2, 1) 2011 Scott B. Baden / CSE 262 / Spring 2011 22

Threads, blocks, grids and data All threads execute same instruction (SIMT) A block may have a different number of dimensions (1d, 2d or 3d) than a grid (1d/2d) Each thread uniquely specified by block & thread ID Programmer determines the mapping of virtual thread IDs to global memory locations : Z n Z 2 Z 3 Θ(Π ι ), Π ι Π Global Memory..... 2011 Scott B. Baden / CSE 262 / Spring 2011 23

Memory Hierarchy (Device) Grid Block (0, 0) Shared Memory Registers Registers Thread (0, 0) Thread (1, 0) Block (1, 0) Shared Memory Registers Registers Thread (0, 0) Thread (1, 0) Name Latency (cycles) Global DRAM 100s No Local DRAM 100s No Constant 1s 10s 100s Yes Texture 1s 10s 100s Yes Shared 1 -- Register 1 -- Cached Local Memory Local Memory Local Memory Local Memory Host Global Memory Constant Memory Texture Memory Courtesy DavidKirk/NVIDIA and Wen-mei Hwu/UIUC 2011 Scott B. Baden / CSE 262 / Spring 2011 24

Run time void syncthreads() Synchronizes all threads in a block Threads wait until all have arrived Avoids memory hazards when accessing shared or global memory Beware of conditionals if (condition(x)) syncthreads(); 2011 Scott B. Baden / CSE 262 / Spring 2011 25

Coding example Increment Array Serial Code void incrementarrayonhost(float *a, int N){ int i; for (i=0; i < N; i++) a[i] = a[i]+1.f; } #include <cuda.h> global void incrementondevice(float *a, int N) { int idx = blockidx.x*blockdim.x + threadidx.x; if (idx<n) a[idx] = a[idx]+1.f; } incrementondevice <<< nblocks, blocksize >>> (a_d, N); Rob Farber, Dr Dobb s Journal 2011 Scott B. Baden / CSE 262 / Spring 2011 26

Managing memory float *a_h, *b_h; float *a_d; // pointers to host memory // pointer to device memory cudamalloc((void **) &a_d, size); for (i=0; i<n; i++) a_h[i] = (float)i; // init host data cudamemcpy(a_d, a_h, sizeof(float)*n, cudamemcpyhosttodevice); 2011 Scott B. Baden / CSE 262 / Spring 2011 27

Computing and returning result int bsize = 4; int nblocks = N/bSize + (N%bSize == 0?0:1); incrementondevice <<< nblocks, bsize >>> (a_d, N); // Retrieve result from device and store in b_h cudamemcpy(b_h, a_d, sizeof(float)*n, cudamemcpydevicetohost); // check results for (i=0; i<n; i++) assert(a_h[i] == b_h[i]); // cleanup free(a_h); free(b_h); cudafree(a_d); 2011 Scott B. Baden / CSE 262 / Spring 2011 28

CUDA language extensions Type qualifiers to declare device kernel functions global void matrixmul( ) Kernel launch syntax matrixmul<<< grid, threads >>>( ) Keywords blockidx, threadidx, blockdim, griddim Runtime, e.g. storage allocation cudamalloc, cudafree, cudamemcpy 2011 Scott B. Baden / CSE 262 / Spring 2011 29

Configuration variables Types to manage thread geometries dim3 griddim, blockdim Dimensions of the grid in blocks (griddim.z not used) Dimensions of a thread block in threads dim3 blockidx, threadidx; Block index within the grid Thread index within the block global void KernelFunc(...); dim3 DimGrid(40, 30); // 1200 thread blocks dim3 DimBlock(4, 8, 16); // 512threads per block Kernel<<< DimGrid, DimBlock, >>>(...); 2011 Scott B. Baden / CSE 262 / Spring 2011 30

For next time Run the provided increment array benchmark Report performance on Lilliput for various values of N, # timesteps, and thread block sizes What settings minimized the running time per point? Compare performance with the host Next, change the increment function to the sin() function, scaling the input array to cover the range [0:2π) Why are the results different on the host and device? Comment out the verification code and report performance Add a loop to repeat the computation several times, account for what you observe Be prepared to discuss your results in class next time Code is found in lilliput:$pub/examples/cuda/incrarr PUB = /class/public/cse262-sp11 4/5/11 2011 Scott B. Baden / CSE 262 / Spring 2011 31

Assignment Starting from a basic (unoptimized) CUDA implementation Optimize matrix multiplication for the GPU Fermi and 200 series Next, implement matrix multiplication in MPI Lilliput Trestles, if time Kraken You ll be given an efficient (blocked for cache) serial implementation Goals Understand differences between Fermi and 200 series GPUs Comparative performance across platforms: how many MPI cores needed to deliver comparable performance to the GPU? 4/5/11 2011 Scott B. Baden / CSE 262 / Spring 2011 32