CPUs+GPUs. PCIe: 1GB/s per lane x16 => 16GB/s full duplex. DMI2 20 Gb/s 40x PCIe C L1 L2. 40x PCIe. 2 QPI links 2x 19.2GB/s Full Duplex.

Size: px
Start display at page:

Download "CPUs+GPUs. PCIe: 1GB/s per lane x16 => 16GB/s full duplex. DMI2 20 Gb/s 40x PCIe C L1 L2. 40x PCIe. 2 QPI links 2x 19.2GB/s Full Duplex."

Transcription

1 GPU AND MANYCORE

2 CPUs+GPUs PCH: PCI, SCSI/ SATA, USB, Audio, etc. DMI2 20 Gb/s 40x PCIe C L1 L2 C L1 L2 C L1 L2 C L1 L2 Memory 1A 2A Banks 3A C L1 L2 C L1 L2 L3 Memory Controller 1B 2B 3B x16 1C 2C 3C C L1 L2 C L1 L2 1D 2D 3D 2 QPI links 2x 19.2GB/s Full Duplex C L1 L2 C L1 L2 Memory Banks x16 C L1 L2 C L1 L2 Memory Controller 1A 2A 3A C L1 L2 C L1 L2 40x PCIe L3 1B 2B 3B PCIe: 1GB/s per lane x16 => 16GB/s full duplex 1C 2C 3C C L1 L2 C L1 L2 1D 2D 3D

3 Lookahead: Programming Model? #pragma omp accelerate [clauses...] Structured-block

4 Programming Model GPU is a multi-processor One or more processes per GPU? How to start this processes? Persistent or Fork-Join? Distributed memory wrt CPU cores? What about memory across GPU cores CPU core CPU core CPU core CPU core Memory GPU cores Mem GPU cores Mem

5 Programming Model? Execution Model: Fork off low synchronization work-items to each accelerator Memory Model Explicit memory copy to each accelerator Support shared memory

6

7 Volta

8 SM

9 Tesla K40 Tesla M40 Tesla P100 Tesla V100 SMs FP32 Cores/GPU FP64 Cores/GPU GPU Boost Clock 810/875 MHz 1114 MHz 1480 MHz 1462 MHz Peak FP32 TFLOPS Peak FP64 TFLOPS Texture Units Memory Interface 84-bit GDDR5 384-bit GDDR bit HBM bit HBM2 Memory Size Up to 12 GB Upto24GB 16 GB 16 GB L2 Cache Size 1536 KB 3072 KB 4096 KB 6144 KB Shared Mem Size/SM 16/32/48 KB 96 KB 64 KB Up to 96KB Register File Size/GPU 3840 KB 6144 KB KB KB TDP 235 WaSs 235 WaCs 250 WaCs 300 WaCs 300 WaCs Transistors 7.1 billion 8 billion 15.3 billion 21.1 billion

10 Kepler Source nvidia 10

11 Kepler: SMX Source nvidia

12 Pascal P100 12

13 P100 SM 13

14 3D Memory 3D Chip-on-Wafer integration 5X bandwidth 2.5X capacity 4X energy efficiency

15 Parallel Memory Read PCI Express: 16 GB/s Broadwell memory BW: 60/s P100 memory BW: 288GB/s nvlink: 20 GB/s (per link) 15

16 NVLink Differential with embedded clock PCIe programming model (w/ DMA+) Unified Memory (Cache coherency in Gen 2.0) 5 to 12X PCIe 16

17 Inter-GPU NVLink Source: nvidia, 2016

18 CPU vs GPU architecture Memory latency needs to be hidden Run many threads ~ 8MB ~ 64KB Can do because of high compute density Source nvidia 18

19 Cuda Architecture 19 Courtesy NVIDIA

20 GPU Performance Massively parallel: a few thousand cores Low power Massively threaded:many thousand threads Hardware-supported threads 20

21 Programming Model? #pragma omp accelerate [clauses...] Structured-block

22 CUDA Programming Model Co-processor Many cores CPU code schedules multi-threaded tasks Message passing and shared memory models GPU Threads are organized hierarchically Grids, Blocks, Warps Shared memory Memory hierarchy opaque Parallel read/write 22

23 CUDA is C-like Integrated host+device app C program Serial or modestly parallel parts in host C code Highly parallel parts in device SPMD kernel C code Serial Code (host) Parallel Kernel (device) KernelA<<< nblk, ntid >>>(args);... Serial Code (host) Parallel Kernel (device) KernelB<<< nblk, ntid >>>(args); Courtesy Kirk & Hwu

24 CUDA Devices and Threads A compute device Is a coprocessor to the CPU or host Has its own DRAM (device memory) Runs many threads in parallel Is typically a GPU but can also be another type of parallel processing device Data-parallel portions of an application are expressed as device kernels which run on many threads Differences between GPU and CPU threads GPU threads are extremely lightweight Very little creation overhead GPU needs 1000s of threads for full efficiency Multi-core CPU needs only a few 24 Courtesy Kirk & Hwu

25 Extended C Declspecs device float filter[n]; global, device, shared, local, constant global void convolve (float *image){ Built-in variables threadidx, blockidx Intrinsics syncthreads RunZme API } Memory, symbol, execuzon management FuncZon launch shared float region[m];... region[threadidx] = image[i]; syncthreads()... image[j] = result; // Allocate GPU memory void *myimage = cudamalloc(bytes) // 100 blocks, 10 threads per block convolve<<<100, 10>>> (myimage); 25 Courtesy Kirk & Hwu

26 Extended-C SW stack Integrated source (foo.cu) nvcc C/C++ frontend cuda-gdb GPU Assembly foo.s CPU Host Code foo.cpp CUDA Visual Profiler OCG gcc / cl Parallel Nsight Architecture SASS foo.sass 26

27 CUDA software pipeline Source files has mix of host and device code nvcc separates device code from host code and compiles device code into PTX/cubin Host code is output as C code nvcc can invoke the host compiler or, it can be compiled later Applications can link to the generated host code host code includes PTX/cubin code as a global initialized data array and cudart CUDA C runtime function calls to load and launch kernels Alternatvely, one may load and execute the PTX/cubin using the CUDA driver API host code is then ignored 27

28 CUDA software architecture Provides library funcmons for host as well as device Implement subset of stdlib Source nvidia 28

29 System Requirements CUDA GPU With CUDA device driver CUDA software CUDA Toolkit Tools to build a CUDA Libraries header files, and other resources CUDA SDK Sample projects (with configurations) including utility functions C/C++ compiler Needs to be a compatible version 29

30 Arrays of Parallel Threads A CUDA kernel is executed many times By a block of threads running concurrently Once per thread, each running the same kernel (SPMD) Thread have access to their ID may compute different memory addresses or control thread ID float x = input[tid]; float y = func(x); output[tid] = y; float x = input[tid]; float y = func(x); output[tid] = y; float x = input[tid]; float y = func(x); output[tid] = y; float x = input[tid]; float y = func(x); output[tid] = y; float x = input[tid]; float y = func(x); output[tid] = y; float x = input[tid]; float y = func(x); output[tid] = y; float x = input[tid]; float y = func(x); output[tid] = y; float x = input[tid]; float y = func(x); output[tid] = y; 30

31 Arrays of Parallel Blocks Multiple blocks of threads may execute a kernel A grid of blocks Threads within a block communicate using shared memory, global memory, atomic operations and barrier Threads in different blocks only share global memory thread ID Thread Block 0 Thread Block 1 Thread Block N float x = input[tid]; float y = func(x); output[tid] = y; float x = input[tid]; float y = func(x); output[tid] = y; float x = input[tid]; float y = func(x); output[tid] = y; 31

32 Main CUDA Construct Run k instances of function f Such parallel functions are called Kernels Declared with global specifier // Kernel definition global void f(float* F) { int id = threadidx.x; } int main() { // Kernel invocation f<<<1, N>>>(F); } Object oriented parts of C++ not supported for device code in earlier CUDA 32

33 Kernel Invocation <<<A, B>>> specifies a 2-level hierarchy Grid of blocks A blocks, each block is of size B All thread within a block scheduled on the same SIMD SM Can share local memory Actually called shared memory in CUDA lingo There is separate thread local memory Ironically, may not be physically close Can synchronize with each other in a block syncthreads() Different blocks only loosely tied Must be able to execute independently (concurrently) Do share global memory 33

34 Thread Execution A block does not execute in a SIMD fashion Executed in groups of 32 parallel threads called warps There need not be 32 physical cores All threads of a warp start together But may diverge by branching Branch paths are serialized until they converge back Important efficiency consideration 34

35 Grid/Block Dimension Invocation: <<<A,B>>> A and B need not be ints A is a (up to) two dimensional vector dim3 A(m, n, 1); m, n are ints B is a (up to) three dimensional vector dim3 B(a, b, c); a, b, c are ints a x b x c <= 512 on Tesla (1024 on Fermi, Kepler) Up to 8 blocks may co-exist on SM (16 on Kepler) Resource sharing can further limit the counts; at least 1 must fit c is the most significant dimension, a is the least Dereference: B.x, B.y and B.z Thread ID = (B.x + B.y * a + B.z * a*b) 35

36 Thread ID b y z x a c (x + y * a + z * a*b). 36

37 CUDA Memory Model Overview Global memory Main host-device data communication path Visible to all threads Long latency L1 and L2 exist on Fermi/Kepler Shared Memory Fast memory Use as scratch Shared across block More memory segments Constant and texture Read-only, cached Host Grid Block (0, 0) Registers Shared Memory Thread (0, 0) Global Memory Registers Thread (1, 0) Block (1, 0) Registers Shared Memory Thread (0, 0) Registers Thread (1, 0) 37 courtesy Kirk & Hwu

38 Memory Model Details Shared memory is tied to a block Lifetime ends with the block global, constant, and texture memories are persistent across kernels (within application) These are recognized as device memory Separate from host memory App must explicitly allocate/de-allocate device memory And manage data transfer between host and device memory 38

39 CUDA Device Memory Allocation cudamalloc() Allocates in global memory parameters: Address of a pointer to the allocated object Size of allocated object No graphics display reset cudafree() Frees object from global memory Called on the host! Appears like host pointers Grid Block (0, 0) Registers Global Memory Shared Memory Thread (0, 0) Host Registers Thread (1, 0) Block (1, 0) Shared Memory Registers Registers Thread (0, 0) Thread (1, 0) 39 Thanks Kirk & Hwu

40 CUDA Device Memory Allocation Code example: (cont.) Allocate a 64 * 64 single precision float array Attach the allocated storage to dm Prefix/Suffix d often used for device data TILE_WIDTH = 64; float *dm, *M; int size = TILE_WIDTH * TILE_WIDTH * sizeof(float); cudamalloc((void**)&dm, size); cudafree(dm); 40 Courtesy Kirk & Hwu

41 Example Memory Copy size_t size = N * sizeof(float); // Allocate vector in host memory float* ha = (float*)malloc(size); // Declare device vectors handles float *da, *db, *dc; // Allocate vectors in device global memory cudamalloc(&da, size); cudamalloc(&db, size); // Copy host->device cudamemcpy(da, ha, size, cudamemcpyhosttodevice); // Invoke kernel on GPU ProcessDo<<<blocksPerGrid, threadsperblock>>>(da, db, N); // Copy result, h_b, from device memory to host memory cudamemcpy(hb, db, size, cudamemcpydevicetohost); // Free device memory cudafree(da); cudafree(db); See cudamallocpitch() and cudamalloc3d() for allocating 2D/3D arrays. Pads to meet alignment for efficient access (also see cudamemcpy2d(), cudamemcpy3d(), cudamemcpytosymbol() ). 41

42 Example Host-Device Data Transfer cudamemcpy(dm, M, size, cudamemcpyhosttodevice); cudamemcpy(m, dm, size, cudamemcpydevicetohost); Transfers a 64 * 64 float array M is in host memory and dm is in device memory cudamemcpyhosttodevice and cudamemcpydevicetohost are symbolic constants Also see cudamemcpyasync(). Recall allocazon earlier 42 Courtesy Kirk & Hwu

43 More Ways to Initialize constant float constdata[256]; float data[256]; cudamemcpytosymbol(constdata, data, sizeof(data)) There also is page-locked (i.e., pinned) host memory cudahostalloc() and cudafreehost() Copies between page-locked host memory and device memory can be performed concurrently with kernel execution Page-locked host memory can be directly mapped into the address space of the device Bandwidth between host memory and device memory is generally higher 43

44 Memory Mapping A block of page-locked host memory can mapped into the address space of the device Use flag cudahostallocmapped in cudahostalloc() cudahostalloc() returns host memory pointer device memory pointer can be retrieved using cudahostgetdevicepointer() Accessing host-mapped memory from within a kernel has some advantages: data transfers are implicitly performed as needed by the kernel kernel-originated data transfers can overlap with kernel execution No need to use streams to overlap data transfers with kernel execution Application must synchronize memory accesses using streams or events to prevent RAW, WAR, or WAW hazards. 44

45 Memory Access Efficiency All (active) threads of the warp load/store Concurrent accesses by a warp can be coalesced into memory transactions Shared memory distributed into banks Non-conflicting access efficient Atomic operations are serialized 45

46 CUDA Function Qualifiers device float dsomename() {} global void ksomename() {} global defines a kernel function Must return void called from host, run on device No recursion device are executed and called on device host qualifier also exists device and host can be used together 46

47 CUDA Function Declarations (cont.) device functions inlined for Tesla can noinline and forceinline recursion is not supported in older versions For functions executed on the device: No static variables inside the function No variable number of arguments 47 Courtesy Kirk & Hwu

48 Calling a Kernel Function Thread Creation kernels called with execuzon configurazon: global void KernelFunc(...); dim3 DimGrid(100, 50); // 5000 thread blocks dim3 DimBlock(4, 8, 8); // 256 threads/block size_t SharedMemBytes = 64; // 64 bytes of shared mem KernelFunc<<< DimGrid, DimBlock, SharedMemBytes >>>(...); Dynamically allocated. More later. Calls to a kernel function are asynchronous Parameters are passed through shared/constant memory Provision for explicit sync for blocking 48 Courtesy Kirk & Hwu

49 CUDA Variable Qualifiers device Can be used in conjunction with qualifiers below Resides in global memory; visible to all threads Accessible from host through the runtime library constant Read only on device shared Has the lifetime of the block Accessible only from threads of that block extern allowed only with a single array cannot be initialized at declaration Writes to non-volatile variables may be delayed syncthreads() required to ensure visibility across threads File scope only: No extern No struct/union in formal param No struct/union on local vars Cannot be used on host-local variables constant/shared are static 49

50 External Shared Memory Variables Instead of: extern shared short array0[128]; extern shared float array1[64]; extern shared int array2[256]; Do extern shared char array[]; device void func() // or global { short* array0 = (short*)array; float* array1 = (float*)&array0[128]; int* array2 = (int*)&array1[64]; } 50

51 Memory Usage Local variables in register or local memory Pointers in device code must be resolved at compile time Whether pointing to shared or global memory [1.x] Following a global memory pointer in host code is not allowed The address obtained by taking the address of a device, shared or constant variable can only be used in device code. The address of a device or constant variable queried using cudagetsymboladdress() on host 51

52 Built-in Vector Types char1, uchar1, char2, uchar2, char3, uchar3, char4, uchar4, short1, ushort1, short2, ushort2, short3, ushort3, short4, ushort4, int1, uint1, int2, uint2, int3, uint3, int4, uint4, long1, ulong1, long2, ulong2, long3, ulong3, long4, ulong4, float1, float2, float3, float4, double2 dim3 int2 point = make_int2(int x, int y); 52

53 void aplusb(int n, float *restrict a, float *b) { #pragma acc kernels for( int j = 1; j < n-1; j++) { #pragma acc kernels for( int i = 1; i < m-1; i++ ) { for (int i = 0; i < n; i++) a[i] = a[i] + b[i]; } } #pragma acc data copy(a, Anew) #pragma acc kernels #pragma acc data copy(a, Anew) while ( error OpenACC > tol && iter < iter_max ) { error = 0.f; } Anew[j][i] = 0.25f * ( A[j][i+1] + A[j][i-1]+ A[j-1][i] + A[j+1][i]); error = maxf(error, fabsf(anew[j][i]-a[j][i])); #pragma acc kernels for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } Jacobi Iteration } iter++; 53

54 OpenCL Execution Model: Kernels are executed by one or more work-items. Work-items are collected into work-groups and each work-group executes on a compute unit. Memory Model Kernel data must be specifically placed in one of four address spaces global memory, constant memory, local memory, or private memory. The location of the data determines how quickly it can be processed.

55 Matrix Multiplication Example Illustrates basic memory/thread management Assumes square matrix for simplicity No shared memory usage yet Local, register usage Thread ID usage Memory data transfer between host and device 55 Courtesy Kirk & Hwu

56 Square Matrix Multiplication P = M * N of size WIDTH x WIDTH Without tiling: One thread calculates one element of P M and N are loaded WIDTH times from global memory M P N N WIDTH M P WIDTH WIDTH WIDTH 56 Courtesy Kirk & Hwu

57 MATMUL: A Simple Host Version in C // Matrix multiplication on the (CPU) host in double precision void MatrixMulOnHost(float* M, float* N, float* P, int Width) N { } for (int i = 0; i < Width; ++i) for (int j = 0; j < Width; ++j) { double sum = 0; for (int k = 0; k < Width; ++k) { double a = M[i * width + k]; double b = N[k * width + j]; sum += a * b; M } P[i * Width + j] = sum; } k i P j k WIDTH WIDTH 38 WIDTH WIDTH 57 Courtesy Kirk & Hwu

58 MATMUL: Matrix Data Transfer void MatrixMulOnDevice(float* M, float* N, float* P, int Width) { int size = Width * Width * sizeof(float); float *Md, *Nd, *Pd; 1. // Allocate and Load M, N to device memory cudamalloc(&md, size); cudamemcpy(md, M, size, cudamemcpyhosttodevice); cudamalloc(&nd, size); cudamemcpy(nd, N, size, cudamemcpyhosttodevice); // Allocate P on the device cudamalloc(&pd, size); 2. // Kernel invocation code to be shown later 3. // Read P from the device cudamemcpy(p, Pd, size, cudamemcpydevicetohost); // Free device matrices cudafree(md); cudafree(nd); cudafree (Pd); 58 Courtesy Kirk & Hwu }

59 MATMUL: Kernel Function // Matrix mulmplicamon kernel per thread code global void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) { // Pvalue stores the matrix element computed by the thread float Pvalue = 0; for (int k = 0; k < Width; ++k) { float Melement = Md[threadIdx.y*Width+k]; float Nelement = Nd[k*Width+threadIdx.x]; Pvalue += Melement * Nelement; } } Pd[threadIdx.y*Width+threadIdx.x] = Pvalue; 41 59

60 MATMUL: Matrix Data Transfer void MatrixMulOnDevice(float* M, float* N, float* P, int Width) { int size = Width * Width * sizeof(float); float *dm, *dn, *dp; 1. // Allocate and Load M, N to device memory cudamalloc(&dm, size); cudamemcpy(dm, M, size, cudamemcpyhosttodevice); cudamalloc(&dn, size); cudamemcpy(dn, N, size, cudamemcpyhosttodevice); // Allocate P on the device cudamalloc(&dp, size); 2. // Kernel invocation code to be shown later 3. // Read P from the device cudamemcpy(p, dp, size, cudamemcpydevicetohost); // Free device matrices cudafree(dm); cudafree(dn); cudafree (dp); 60 Courtesy Kirk & Hwu }

61 MATMUL: Matrix Data Transfer void MatrixMulOnDevice(float* M, float* N, float* P, int Width) { int size = Width * Width * sizeof(float); float *dm, *dn, *dp; 1. // Allocate and Load M, N to device memory cudamalloc(&dm, size); cudamemcpy(dm, M, size, cudamemcpyhosttodevice); cudamalloc(&dn, size); cudamemcpy(dn, N, size, cudamemcpyhosttodevice); // Allocate P on the device cudamalloc(&dp, size); 2. // Setup the execution configuration dim3 dimgrid(1, 1); dim3 dimblock(width, Width); // Launch the device computation threads! MatrixMulKernel<<<dimGrid, dimblock>>>(dm, dn, dp, Width); 3. // Read P from the device cudamemcpy(p, dp, size, cudamemcpydevicetohost); // Free device matrices cudafree(dm); cudafree(dn); cudafree (dp); } 61

62 Single Thread Blocks One thread-block computes Pd Each thread computes one element of Pd Each thread Loads a row of matrix Md Loads a column of matrix Nd Perform one multiply and addition for each pair of Md and Nd elements Compute to off-chip memory access ratio close to 1:1 (not very high) Size of matrix limited by the number of threads allowed in a thread block Grid 1 Block 1 Thread (2, 2) Nd 48 WIDTH Md Pd 62 Courtesy Kirk & Hwu

63 MATMUL: Multiple Blocks 2D thread block computes a (TILE_WIDTH) 2 sub-matrix of Pd (TILE_WIDTH) 2 threads/block Generate a 2D Grid of (WIDTH/ TILE_WIDTH) 2 blocks Nd WIDTH Smll need a loop around the kernel call for cases where WIDTH/TILE_WIDTH is greater than max grid size! Md Pd bx by TILE_WIDTH ty tx WIDTH WIDTH 45 WIDTH 63 Courtesy Kirk & Hwu

64 MATMUL: Multiple Blocks 2D thread block computes a (TILE_WIDTH) 2 sub-matrix of Pd (TILE_WIDTH) 2 threads/block Nd tx WIDTH Generate a 2D Grid of (WIDTH/ TILE_WIDTH) 2 blocks If grid size required is greater than maximum allowed, invoke mulmple kernels. Md ty Pd by TILE_WIDTH tx ty k bx, by WIDTH WIDTH WIDTH 64

65 Synchronization (Block) void syncthreads() block barrier AND ensure all global/shared memory accesses by all threads are visible in the block int syncthreads_count(int predicate) returns the count of threads where predicate!= 0 int syncthreads_and(int predicate) returns non-zero iff predicate!= 0 for all threads int syncthreads_or(int predicate) returns non-zero iff predicate!= 0 for any threads

66 Intra Warp Synchronization int all(int predicate) Return non-zero iff predicate!= 0 for all threads int any(int predicate) Return non-zero iff predicate!= 0 for any threads uint ballot(int predicate) Return an int with nth bit set iff predicate!= 0 for the nth thread of the warp Only supported by devices of compute capability above

67 Atomic Operations Read-modify-write on one 32-bit or 64-bit word in global or shared memory. Atomic ops on 64-bit words in shared memory available since 2.0 If the target is mapped page-locked memory: it s not atomic from host s perspective Only supported on signed and unsigned integers except atomicexch and atomicadd() in 2.0 that also work for float. More generic Compare and Swap atomiccas() 67

68 CAS Example device double atomicadd(double* address, double val) { double old = *address, assumed; do { assumed = old; old = longlong_as_double( atomiccas( (unsigned long long int*)address, double_as_longlong(assumed), double_as_longlong(val+ assumed) )); } while (assumed!= old); return old; } 68

69 Thrust Template Library #include <thrust/host_vector.h> #include <thrust/device_vector.h> int main(void) { thrust::host_vector<int> h_vec(1 << 24); // Initialize host array now.. } // transfer data to the device thrust::device_vector<int> d_vec = h_vec; // Do more.. // Later transfer data to the device thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin()); return 0; 69

70 Checking Return Values Functions return cudaerror_t cudasuccess cudaerrorinvalidvalue cudaerrorinvalidsymbol cudaerrorinvaliddevicepointer cudaerrorinvalidmemcpydirection 70

71 CUDA Sort Example #define NUM 256 int main(int argc, char** argv) { } int values[num]; for(int i = 0; i < NUM; i++) values[i] = getnextvalue(); int *dvalues; cudamalloc((void**)&dvalues, sizeof(int) * NUM); cudamemcpy(dvalues, values, sizeof(int)*num, cudamemcpyhosttodevice)); ParallelSort<<<1,NUM, sizeof(int)*num>>> (dvalues); cudamemcpy(values, dvalues, sizeof(int)* NUM, cudamemcpydevicetohost)); cudafree(dvalues); Contd.. 71

72 device inline void swap(int &a, int &b) { int tmp = a; // We expect these in register a = b; b = tmp; } global static void ParallelSort(int * values) { extern shared int shared[]; const unsigned int tid = threadidx.x; // Copy input to shared mem. shared[tid] = values[tid]; syncthreads(); Sort Kernel pg 1 72

73 } // Parallel bitonic sort. for (unsigned int k = 2; k <= NUM; k >>= 1) { } for (unsigned int j = k/2; j>0; j <<= 2) { } } unsigned int ixj = tid ^ j; if (ixj > tid) { if ((tid & k) == 0) { if (shared[tid] > shared[ixj]) } else { } swap(shared[tid], shared[ixj]); if (shared[tid] < shared[ixj]) syncthreads(); swap(shared[tid], shared[ixj]); values[tid] = shared[tid]; // Write result. Sort Kernel pg 73 2

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop General Purpose GPU programming (GP-GPU) with Nvidia CUDA Libby Shoop 3 What is (Historical) GPGPU? General Purpose computation using GPU and graphics API in applications other than 3D graphics GPU accelerates

More information

CIS 665: GPU Programming. Lecture 2: The CUDA Programming Model

CIS 665: GPU Programming. Lecture 2: The CUDA Programming Model CIS 665: GPU Programming Lecture 2: The CUDA Programming Model 1 Slides References Nvidia (Kitchen) David Kirk + Wen-Mei Hwu (UIUC) Gary Katz and Joe Kider 2 3D 3D API: API: OpenGL OpenGL or or Direct3D

More information

GPGPU. Lecture 2: CUDA

GPGPU. Lecture 2: CUDA GPGPU Lecture 2: CUDA GPU is fast Previous GPGPU Constraints Dealing with graphics API Working with the corner cases of the graphics API Addressing modes Limited texture size/dimension Shader capabilities

More information

Computation to Core Mapping Lessons learned from a simple application

Computation to Core Mapping Lessons learned from a simple application Lessons learned from a simple application Matrix Multiplication Used as an example throughout the course Goal for today: Show the concept of Computation-to-Core Mapping Block schedule, Occupancy, and thread

More information

Lessons learned from a simple application

Lessons learned from a simple application Computation to Core Mapping Lessons learned from a simple application A Simple Application Matrix Multiplication Used as an example throughout the course Goal for today: Show the concept of Computation-to-Core

More information

GPU Computation CSCI 4239/5239 Advanced Computer Graphics Spring 2014

GPU Computation CSCI 4239/5239 Advanced Computer Graphics Spring 2014 GPU Computation CSCI 4239/5239 Advanced Computer Graphics Spring 2014 Solutions to Parallel Processing Message Passing (distributed) MPI (library) Threads (shared memory) pthreads (library) OpenMP (compiler)

More information

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34 1 / 34 GPU Programming Lecture 2: CUDA C Basics Miaoqing Huang University of Arkansas 2 / 34 Outline Evolvements of NVIDIA GPU CUDA Basic Detailed Steps Device Memories and Data Transfer Kernel Functions

More information

Matrix Multiplication in CUDA. A case study

Matrix Multiplication in CUDA. A case study Matrix Multiplication in CUDA A case study 1 Matrix Multiplication: A Case Study Matrix multiplication illustrates many of the basic features of memory and thread management in CUDA Usage of thread/block

More information

Using The CUDA Programming Model

Using The CUDA Programming Model Using The CUDA Programming Model Leveraging GPUs for Application Acceleration Dan Ernst, Brandon Holt University of Wisconsin Eau Claire 1 What is (Historical) GPGPU? General Purpose computation using

More information

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC. David Kirk/NVIDIA and Wen-mei Hwu, 2006-2008 This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC. Please send any comment to dkirk@nvidia.com

More information

Lecture 1: Introduction

Lecture 1: Introduction ECE 498AL Applied Parallel Programming Lecture 1: Introduction 1 Course Goals Learn how to program massively parallel processors and achieve high performance functionality and maintainability scalability

More information

CS 677: Parallel Programming for Many-core Processors Lecture 1

CS 677: Parallel Programming for Many-core Processors Lecture 1 1 CS 677: Parallel Programming for Many-core Processors Lecture 1 Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Objectives Learn how to program

More information

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions. Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication

More information

Parallel Computing. Lecture 19: CUDA - I

Parallel Computing. Lecture 19: CUDA - I CSCI-UA.0480-003 Parallel Computing Lecture 19: CUDA - I Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com GPU w/ local DRAM (device) Behind CUDA CPU (host) Source: http://hothardware.com/reviews/intel-core-i5-and-i7-processors-and-p55-chipset/?page=4

More information

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun Outline Memory Management CIS 565 Fall 2011 Qing Sun sunqing@seas.upenn.edu Kernels Matrix multiplication Managing Memory CPU and GPU have separate memory spaces Host (CPU) code manages device (GPU) memory

More information

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 CUDA Lecture 2 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de December 15, 2015 CUDA Programming Fundamentals CUDA

More information

Lecture 3: Introduction to CUDA

Lecture 3: Introduction to CUDA CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Introduction to CUDA Some slides here are adopted from: NVIDIA teaching kit Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

More information

Many-core Processors Lecture 1. Instructor: Philippos Mordohai Webpage:

Many-core Processors Lecture 1. Instructor: Philippos Mordohai Webpage: 1 CS 677: Parallel Programming for Many-core Processors Lecture 1 Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Objectives Learn how to program

More information

Introduction to GPGPUs and to CUDA programming model

Introduction to GPGPUs and to CUDA programming model Introduction to GPGPUs and to CUDA programming model www.cineca.it Marzia Rivi m.rivi@cineca.it GPGPU architecture CUDA programming model CUDA efficient programming Debugging & profiling tools CUDA libraries

More information

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Basic Elements of CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of

More information

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms CS 590: High Performance Computing GPU Architectures and CUDA Concepts/Terms Fengguang Song Department of Computer & Information Science IUPUI What is GPU? Conventional GPUs are used to generate 2D, 3D

More information

CSE 591: GPU Programming. Programmer Interface. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591: GPU Programming. Programmer Interface. Klaus Mueller. Computer Science Department Stony Brook University CSE 591: GPU Programming Programmer Interface Klaus Mueller Computer Science Department Stony Brook University Compute Levels Encodes the hardware capability of a GPU card newer cards have higher compute

More information

ECE 574 Cluster Computing Lecture 15

ECE 574 Cluster Computing Lecture 15 ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements

More information

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers COMP 322: Fundamentals of Parallel Programming Lecture 37: General-Purpose GPU (GPGPU) Computing Max Grossman, Vivek Sarkar Department of Computer Science, Rice University max.grossman@rice.edu, vsarkar@rice.edu

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Introduction to CELL B.E. and GPU Programming. Agenda

Introduction to CELL B.E. and GPU Programming. Agenda Introduction to CELL B.E. and GPU Programming Department of Electrical & Computer Engineering Rutgers University Agenda Background CELL B.E. Architecture Overview CELL B.E. Programming Environment GPU

More information

Intro to GPU s for Parallel Computing

Intro to GPU s for Parallel Computing Intro to GPU s for Parallel Computing Goals for Rest of Course Learn how to program massively parallel processors and achieve high performance functionality and maintainability scalability across future

More information

Introduction to CUDA (2 of 2)

Introduction to CUDA (2 of 2) Announcements Introduction to CUDA (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2012 Open pull request for Project 0 Project 1 released. Due Sunday 09/30 Not due Tuesday, 09/25 Code

More information

Module 2: Introduction to CUDA C. Objective

Module 2: Introduction to CUDA C. Objective ECE 8823A GPU Architectures Module 2: Introduction to CUDA C 1 Objective To understand the major elements of a CUDA program Introduce the basic constructs of the programming model Illustrate the preceding

More information

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Introduction to GPU programming. Introduction to GPU programming p. 1/17 Introduction to GPU programming Introduction to GPU programming p. 1/17 Introduction to GPU programming p. 2/17 Overview GPUs & computing Principles of CUDA programming One good reference: David B. Kirk

More information

Parallel Accelerators

Parallel Accelerators Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon

More information

Parallel Accelerators

Parallel Accelerators Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon

More information

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research Introduction to CUDA CME343 / ME339 18 May 2011 James Balfour [ jbalfour@nvidia.com] NVIDIA Research CUDA Programing system for machines with GPUs Programming Language Compilers Runtime Environments Drivers

More information

CUDA C Programming Mark Harris NVIDIA Corporation

CUDA C Programming Mark Harris NVIDIA Corporation CUDA C Programming Mark Harris NVIDIA Corporation Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory Models Programming Environment

More information

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Basic Elements of CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

GPU Architecture and Programming. Ozcan Ozturk Bilkent University Ankara, Turkey

GPU Architecture and Programming. Ozcan Ozturk Bilkent University Ankara, Turkey GPU Architecture and Programming Ozcan Ozturk Ankara, Turkey 1 Overview Massively Parallel Processing CPU/GPU Architecture CUDA Programming OpenCL Programming Other Accelerators 2 Overview Massively Parallel

More information

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list CUDA Programming Week 1. Basic Programming Concepts Materials are copied from the reference list G80/G92 Device SP: Streaming Processor (Thread Processors) SM: Streaming Multiprocessor 128 SP grouped into

More information

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA PROGRAMMING MODEL Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA: COMMON UNIFIED DEVICE ARCHITECTURE Parallel computing architecture and programming model GPU Computing Application Includes

More information

Overview: Graphics Processing Units

Overview: Graphics Processing Units advent of GPUs GPU architecture Overview: Graphics Processing Units the NVIDIA Fermi processor the CUDA programming model simple example, threads organization, memory model case study: matrix multiply

More information

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 22 CUDA

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 22 CUDA EE382 (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 22 CUDA Mattan Erez The University of Texas at Austin EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 22

More information

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA HPC COMPUTING WITH CUDA AND TESLA HARDWARE Timothy Lanfear, NVIDIA WHAT IS GPU COMPUTING? What is GPU Computing? x86 PCIe bus GPU Computing with CPU + GPU Heterogeneous Computing Low Latency or High Throughput?

More information

Data Parallel Execution Model

Data Parallel Execution Model CS/EE 217 GPU Architecture and Parallel Programming Lecture 3: Kernel-Based Data Parallel Execution Model David Kirk/NVIDIA and Wen-mei Hwu, 2007-2013 Objective To understand the organization and scheduling

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

Module 2: Introduction to CUDA C

Module 2: Introduction to CUDA C ECE 8823A GPU Architectures Module 2: Introduction to CUDA C 1 Objective To understand the major elements of a CUDA program Introduce the basic constructs of the programming model Illustrate the preceding

More information

Introduction to CUDA Programming

Introduction to CUDA Programming Introduction to CUDA Programming Steve Lantz Cornell University Center for Advanced Computing October 30, 2013 Based on materials developed by CAC and TACC Outline Motivation for GPUs and CUDA Overview

More information

CUDA (Compute Unified Device Architecture)

CUDA (Compute Unified Device Architecture) CUDA (Compute Unified Device Architecture) Mike Bailey History of GPU Performance vs. CPU Performance GFLOPS Source: NVIDIA G80 = GeForce 8800 GTX G71 = GeForce 7900 GTX G70 = GeForce 7800 GTX NV40 = GeForce

More information

Introduction to CUDA (1 of n*)

Introduction to CUDA (1 of n*) Administrivia Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Paper presentation due Wednesday, 02/23 Topics first come, first serve Assignment 4 handed today

More information

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Josef Pelikán, Jan Horáček CGG MFF UK Praha GPGPU and CUDA 2012-2018 Josef Pelikán, Jan Horáček CGG MFF UK Praha pepca@cgg.mff.cuni.cz http://cgg.mff.cuni.cz/~pepca/ 1 / 41 Content advances in hardware multi-core vs. many-core general computing

More information

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center An Introduction to GPU Architecture and CUDA C/C++ Programming Bin Chen April 4, 2018 Research Computing Center Outline Introduction to GPU architecture Introduction to CUDA programming model Using the

More information

Lecture 2: Introduction to CUDA C

Lecture 2: Introduction to CUDA C CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013 1 CUDA /OpenCL Execution Model Integrated host+device app C program Serial or

More information

Lecture 11: GPU programming

Lecture 11: GPU programming Lecture 11: GPU programming David Bindel 4 Oct 2011 Logistics Matrix multiply results are ready Summary on assignments page My version (and writeup) on CMS HW 2 due Thursday Still working on project 2!

More information

Scientific discovery, analysis and prediction made possible through high performance computing.

Scientific discovery, analysis and prediction made possible through high performance computing. Scientific discovery, analysis and prediction made possible through high performance computing. An Introduction to GPGPU Programming Bob Torgerson Arctic Region Supercomputing Center November 21 st, 2013

More information

ECE 574 Cluster Computing Lecture 17

ECE 574 Cluster Computing Lecture 17 ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 March 2019 HW#8 (CUDA) posted. Project topics due. Announcements 1 CUDA installing On Linux

More information

CUDA Programming Model

CUDA Programming Model CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming

More information

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5) CUDA programming model N. Cardoso & P. Bicudo Física Computacional (FC5) N. Cardoso & P. Bicudo CUDA programming model 1/23 Outline 1 CUDA qualifiers 2 CUDA Kernel Thread hierarchy Kernel, configuration

More information

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA OPTIMIZATIONS ISC 2011 Tutorial CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

CUDA Performance Optimization. Patrick Legresley

CUDA Performance Optimization. Patrick Legresley CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include 3.1 Overview Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include GPUs (Graphics Processing Units) AMD/ATI

More information

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization

More information

CUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission)

CUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission) CUDA Memory Model Some material David Kirk, NVIDIA and Wen-mei W. Hwu, 2007-2009 (used with permission) 1 G80 Implementation of CUDA Memories Each thread can: Grid Read/write per-thread registers Read/write

More information

Lecture 5. Performance Programming with CUDA

Lecture 5. Performance Programming with CUDA Lecture 5 Performance Programming with CUDA Announcements 2011 Scott B. Baden / CSE 262 / Spring 2011 2 Today s lecture Matrix multiplication 2011 Scott B. Baden / CSE 262 / Spring 2011 3 Memory Hierarchy

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna14/ [ 10 ] GPU and CUDA Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture and Performance

More information

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I) Lecture 9 CUDA CUDA (I) Compute Unified Device Architecture 1 2 Outline CUDA Architecture CUDA Architecture CUDA programming model CUDA-C 3 4 CUDA : a General-Purpose Parallel Computing Architecture CUDA

More information

University of Bielefeld

University of Bielefeld Geistes-, Natur-, Sozial- und Technikwissenschaften gemeinsam unter einem Dach Introduction to GPU Programming using CUDA Olaf Kaczmarek University of Bielefeld STRONGnet Summerschool 2011 ZIF Bielefeld

More information

Basics of CADA Programming - CUDA 4.0 and newer

Basics of CADA Programming - CUDA 4.0 and newer Basics of CADA Programming - CUDA 4.0 and newer Feb 19, 2013 Outline CUDA basics Extension of C Single GPU programming Single node multi-gpus programing A brief introduction on the tools Jacket CUDA FORTRAN

More information

Master Informatics Eng.

Master Informatics Eng. Advanced Architectures Master Informatics Eng. 2018/19 A.J.Proença Data Parallelism 3 (GPU/CUDA, Neural Nets,...) (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2018/19 1 The

More information

Device Memories and Matrix Multiplication

Device Memories and Matrix Multiplication Device Memories and Matrix Multiplication 1 Device Memories global, constant, and shared memories CUDA variable type qualifiers 2 Matrix Multiplication an application of tiling runningmatrixmul in the

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690 Stanford University NVIDIA Tesla M2090 NVIDIA GeForce GTX 690 Moore s Law 2 Clock Speed 10000 Pentium 4 Prescott Core 2 Nehalem Sandy Bridge 1000 Pentium 4 Williamette Clock Speed (MHz) 100 80486 Pentium

More information

Programming with CUDA, WS09

Programming with CUDA, WS09 Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller Lecture 3 Thursday, 29 Nov, 2009 Recap Motivational videos Example kernel Thread IDs Memory overhead CUDA hardware and programming

More information

CUDA Architecture & Programming Model

CUDA Architecture & Programming Model CUDA Architecture & Programming Model Course on Multi-core Architectures & Programming Oliver Taubmann May 9, 2012 Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New

More information

COSC 6374 Parallel Computations Introduction to CUDA

COSC 6374 Parallel Computations Introduction to CUDA COSC 6374 Parallel Computations Introduction to CUDA Edgar Gabriel Fall 2014 Disclaimer Material for this lecture has been adopted based on various sources Matt Heavener, CS, State Univ. of NY at Buffalo

More information

CUDA C/C++ BASICS. NVIDIA Corporation

CUDA C/C++ BASICS. NVIDIA Corporation CUDA C/C++ BASICS NVIDIA Corporation What is CUDA? CUDA Architecture Expose GPU parallelism for general-purpose computing Retain performance CUDA C/C++ Based on industry-standard C/C++ Small set of extensions

More information

CS516 Programming Languages and Compilers II

CS516 Programming Languages and Compilers II CS516 Programming Languages and Compilers II Zheng Zhang Spring 2015 Jan 29 GPU Programming II Rutgers University Review: Programming with CUDA Compute Unified Device Architecture (CUDA) Mapping and managing

More information

CSE 591: GPU Programming. Memories. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591: GPU Programming. Memories. Klaus Mueller. Computer Science Department Stony Brook University CSE 591: GPU Programming Memories Klaus Mueller Computer Science Department Stony Brook University Importance of Memory Access Efficiency Every loop iteration has two global memory accesses two floating

More information

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Lecture 8: GPU Programming. CSE599G1: Spring 2017 Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library

More information

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture Rafia Inam Mälardalen Real-Time Research Centre Mälardalen University, Västerås, Sweden http://www.mrtc.mdh.se rafia.inam@mdh.se CONTENTS

More information

GPU Fundamentals Jeff Larkin November 14, 2016

GPU Fundamentals Jeff Larkin November 14, 2016 GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate

More information

Fundamental Optimizations

Fundamental Optimizations Fundamental Optimizations Paulius Micikevicius NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access

More information

Advanced CUDA Programming. Dr. Timo Stich

Advanced CUDA Programming. Dr. Timo Stich Advanced CUDA Programming Dr. Timo Stich (tstich@nvidia.com) Outline SIMT Architecture, Warps Kernel optimizations Global memory throughput Launch configuration Shared memory access Instruction throughput

More information

Mathematical computations with GPUs

Mathematical computations with GPUs Master Educational Program Information technology in applications Mathematical computations with GPUs CUDA Alexey A. Romanenko arom@ccfit.nsu.ru Novosibirsk State University CUDA - Compute Unified Device

More information

CUDA C/C++ BASICS. NVIDIA Corporation

CUDA C/C++ BASICS. NVIDIA Corporation CUDA C/C++ BASICS NVIDIA Corporation What is CUDA? CUDA Architecture Expose GPU parallelism for general-purpose computing Retain performance CUDA C/C++ Based on industry-standard C/C++ Small set of extensions

More information

CS 314 Principles of Programming Languages

CS 314 Principles of Programming Languages CS 314 Principles of Programming Languages Zheng Zhang Fall 2016 Dec 14 GPU Programming Rutgers University Programming with CUDA Compute Unified Device Architecture (CUDA) Mapping and managing computations

More information

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Dirk Ribbrock Faculty of Mathematics, TU dortmund 2016 Table of Contents Why parallel

More information

GPU Programming Using CUDA

GPU Programming Using CUDA GPU Programming Using CUDA Michael J. Schnieders Depts. of Biomedical Engineering & Biochemistry The University of Iowa & Gregory G. Howes Department of Physics and Astronomy The University of Iowa Iowa

More information

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca (Slides http://support.scinet.utoronto.ca/ northrup/westgrid CUDA.pdf) March 12, 2014

More information

CUDA Advanced Techniques 2 Mohamed Zahran (aka Z)

CUDA Advanced Techniques 2 Mohamed Zahran (aka Z) CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming CUDA Advanced Techniques 2 Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Alignment Memory Alignment Memory

More information

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

Memory concept. Grid concept, Synchronization. GPU Programming.   Szénási Sándor. Memory concept Grid concept, Synchronization GPU Programming http://cuda.nik.uni-obuda.hu Szénási Sándor szenasi.sandor@nik.uni-obuda.hu GPU Education Center of Óbuda University MEMORY CONCEPT Off-chip

More information

High Performance Linear Algebra on Data Parallel Co-Processors I

High Performance Linear Algebra on Data Parallel Co-Processors I 926535897932384626433832795028841971693993754918980183 592653589793238462643383279502884197169399375491898018 415926535897932384626433832795028841971693993754918980 592653589793238462643383279502884197169399375491898018

More information

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca March 13, 2014 Outline 1 Heterogeneous Computing 2 GPGPU - Overview Hardware Software

More information

CS671 Parallel Programming in the Many-Core Era

CS671 Parallel Programming in the Many-Core Era CS671 Parallel Programming in the Many-Core Era Lecture 3: GPU Programming - Reduce, Scan & Sort Zheng Zhang Rutgers University Review: Programming with CUDA An Example in C Add vector A and vector B to

More information

Introduction to Parallel Computing with CUDA. Oswald Haan

Introduction to Parallel Computing with CUDA. Oswald Haan Introduction to Parallel Computing with CUDA Oswald Haan ohaan@gwdg.de Schedule Introduction to Parallel Computing with CUDA Using CUDA CUDA Application Examples Using Multiple GPUs CUDA Application Libraries

More information

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC. David Kirk/NVIDIA and Wen-mei Hwu, 2006-2008 This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC. Please send any comment to dkirk@nvidia.com

More information

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU

More information

Introduction to GPU (Graphics Processing Unit) Architecture & Programming

Introduction to GPU (Graphics Processing Unit) Architecture & Programming Introduction to GU (Graphics rocessing Unit) Architecture & rogramming C240A. 2017 T. Yang ome of slides are from M. Hall of Utah C6235 Overview Hardware architecture rogramming model Example Historical

More information

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.

More information

CUDA Parallel Programming Model Michael Garland

CUDA Parallel Programming Model Michael Garland CUDA Parallel Programming Model Michael Garland NVIDIA Research Some Design Goals Scale to 100s of cores, 1000s of parallel threads Let programmers focus on parallel algorithms not mechanics of a parallel

More information