CPUs+GPUs. PCIe: 1GB/s per lane x16 => 16GB/s full duplex. DMI2 20 Gb/s 40x PCIe C L1 L2. 40x PCIe. 2 QPI links 2x 19.2GB/s Full Duplex.
|
|
- Hugh Morton
- 5 years ago
- Views:
Transcription
1 GPU AND MANYCORE
2 CPUs+GPUs PCH: PCI, SCSI/ SATA, USB, Audio, etc. DMI2 20 Gb/s 40x PCIe C L1 L2 C L1 L2 C L1 L2 C L1 L2 Memory 1A 2A Banks 3A C L1 L2 C L1 L2 L3 Memory Controller 1B 2B 3B x16 1C 2C 3C C L1 L2 C L1 L2 1D 2D 3D 2 QPI links 2x 19.2GB/s Full Duplex C L1 L2 C L1 L2 Memory Banks x16 C L1 L2 C L1 L2 Memory Controller 1A 2A 3A C L1 L2 C L1 L2 40x PCIe L3 1B 2B 3B PCIe: 1GB/s per lane x16 => 16GB/s full duplex 1C 2C 3C C L1 L2 C L1 L2 1D 2D 3D
3 Lookahead: Programming Model? #pragma omp accelerate [clauses...] Structured-block
4 Programming Model GPU is a multi-processor One or more processes per GPU? How to start this processes? Persistent or Fork-Join? Distributed memory wrt CPU cores? What about memory across GPU cores CPU core CPU core CPU core CPU core Memory GPU cores Mem GPU cores Mem
5 Programming Model? Execution Model: Fork off low synchronization work-items to each accelerator Memory Model Explicit memory copy to each accelerator Support shared memory
6
7 Volta
8 SM
9 Tesla K40 Tesla M40 Tesla P100 Tesla V100 SMs FP32 Cores/GPU FP64 Cores/GPU GPU Boost Clock 810/875 MHz 1114 MHz 1480 MHz 1462 MHz Peak FP32 TFLOPS Peak FP64 TFLOPS Texture Units Memory Interface 84-bit GDDR5 384-bit GDDR bit HBM bit HBM2 Memory Size Up to 12 GB Upto24GB 16 GB 16 GB L2 Cache Size 1536 KB 3072 KB 4096 KB 6144 KB Shared Mem Size/SM 16/32/48 KB 96 KB 64 KB Up to 96KB Register File Size/GPU 3840 KB 6144 KB KB KB TDP 235 WaSs 235 WaCs 250 WaCs 300 WaCs 300 WaCs Transistors 7.1 billion 8 billion 15.3 billion 21.1 billion
10 Kepler Source nvidia 10
11 Kepler: SMX Source nvidia
12 Pascal P100 12
13 P100 SM 13
14 3D Memory 3D Chip-on-Wafer integration 5X bandwidth 2.5X capacity 4X energy efficiency
15 Parallel Memory Read PCI Express: 16 GB/s Broadwell memory BW: 60/s P100 memory BW: 288GB/s nvlink: 20 GB/s (per link) 15
16 NVLink Differential with embedded clock PCIe programming model (w/ DMA+) Unified Memory (Cache coherency in Gen 2.0) 5 to 12X PCIe 16
17 Inter-GPU NVLink Source: nvidia, 2016
18 CPU vs GPU architecture Memory latency needs to be hidden Run many threads ~ 8MB ~ 64KB Can do because of high compute density Source nvidia 18
19 Cuda Architecture 19 Courtesy NVIDIA
20 GPU Performance Massively parallel: a few thousand cores Low power Massively threaded:many thousand threads Hardware-supported threads 20
21 Programming Model? #pragma omp accelerate [clauses...] Structured-block
22 CUDA Programming Model Co-processor Many cores CPU code schedules multi-threaded tasks Message passing and shared memory models GPU Threads are organized hierarchically Grids, Blocks, Warps Shared memory Memory hierarchy opaque Parallel read/write 22
23 CUDA is C-like Integrated host+device app C program Serial or modestly parallel parts in host C code Highly parallel parts in device SPMD kernel C code Serial Code (host) Parallel Kernel (device) KernelA<<< nblk, ntid >>>(args);... Serial Code (host) Parallel Kernel (device) KernelB<<< nblk, ntid >>>(args); Courtesy Kirk & Hwu
24 CUDA Devices and Threads A compute device Is a coprocessor to the CPU or host Has its own DRAM (device memory) Runs many threads in parallel Is typically a GPU but can also be another type of parallel processing device Data-parallel portions of an application are expressed as device kernels which run on many threads Differences between GPU and CPU threads GPU threads are extremely lightweight Very little creation overhead GPU needs 1000s of threads for full efficiency Multi-core CPU needs only a few 24 Courtesy Kirk & Hwu
25 Extended C Declspecs device float filter[n]; global, device, shared, local, constant global void convolve (float *image){ Built-in variables threadidx, blockidx Intrinsics syncthreads RunZme API } Memory, symbol, execuzon management FuncZon launch shared float region[m];... region[threadidx] = image[i]; syncthreads()... image[j] = result; // Allocate GPU memory void *myimage = cudamalloc(bytes) // 100 blocks, 10 threads per block convolve<<<100, 10>>> (myimage); 25 Courtesy Kirk & Hwu
26 Extended-C SW stack Integrated source (foo.cu) nvcc C/C++ frontend cuda-gdb GPU Assembly foo.s CPU Host Code foo.cpp CUDA Visual Profiler OCG gcc / cl Parallel Nsight Architecture SASS foo.sass 26
27 CUDA software pipeline Source files has mix of host and device code nvcc separates device code from host code and compiles device code into PTX/cubin Host code is output as C code nvcc can invoke the host compiler or, it can be compiled later Applications can link to the generated host code host code includes PTX/cubin code as a global initialized data array and cudart CUDA C runtime function calls to load and launch kernels Alternatvely, one may load and execute the PTX/cubin using the CUDA driver API host code is then ignored 27
28 CUDA software architecture Provides library funcmons for host as well as device Implement subset of stdlib Source nvidia 28
29 System Requirements CUDA GPU With CUDA device driver CUDA software CUDA Toolkit Tools to build a CUDA Libraries header files, and other resources CUDA SDK Sample projects (with configurations) including utility functions C/C++ compiler Needs to be a compatible version 29
30 Arrays of Parallel Threads A CUDA kernel is executed many times By a block of threads running concurrently Once per thread, each running the same kernel (SPMD) Thread have access to their ID may compute different memory addresses or control thread ID float x = input[tid]; float y = func(x); output[tid] = y; float x = input[tid]; float y = func(x); output[tid] = y; float x = input[tid]; float y = func(x); output[tid] = y; float x = input[tid]; float y = func(x); output[tid] = y; float x = input[tid]; float y = func(x); output[tid] = y; float x = input[tid]; float y = func(x); output[tid] = y; float x = input[tid]; float y = func(x); output[tid] = y; float x = input[tid]; float y = func(x); output[tid] = y; 30
31 Arrays of Parallel Blocks Multiple blocks of threads may execute a kernel A grid of blocks Threads within a block communicate using shared memory, global memory, atomic operations and barrier Threads in different blocks only share global memory thread ID Thread Block 0 Thread Block 1 Thread Block N float x = input[tid]; float y = func(x); output[tid] = y; float x = input[tid]; float y = func(x); output[tid] = y; float x = input[tid]; float y = func(x); output[tid] = y; 31
32 Main CUDA Construct Run k instances of function f Such parallel functions are called Kernels Declared with global specifier // Kernel definition global void f(float* F) { int id = threadidx.x; } int main() { // Kernel invocation f<<<1, N>>>(F); } Object oriented parts of C++ not supported for device code in earlier CUDA 32
33 Kernel Invocation <<<A, B>>> specifies a 2-level hierarchy Grid of blocks A blocks, each block is of size B All thread within a block scheduled on the same SIMD SM Can share local memory Actually called shared memory in CUDA lingo There is separate thread local memory Ironically, may not be physically close Can synchronize with each other in a block syncthreads() Different blocks only loosely tied Must be able to execute independently (concurrently) Do share global memory 33
34 Thread Execution A block does not execute in a SIMD fashion Executed in groups of 32 parallel threads called warps There need not be 32 physical cores All threads of a warp start together But may diverge by branching Branch paths are serialized until they converge back Important efficiency consideration 34
35 Grid/Block Dimension Invocation: <<<A,B>>> A and B need not be ints A is a (up to) two dimensional vector dim3 A(m, n, 1); m, n are ints B is a (up to) three dimensional vector dim3 B(a, b, c); a, b, c are ints a x b x c <= 512 on Tesla (1024 on Fermi, Kepler) Up to 8 blocks may co-exist on SM (16 on Kepler) Resource sharing can further limit the counts; at least 1 must fit c is the most significant dimension, a is the least Dereference: B.x, B.y and B.z Thread ID = (B.x + B.y * a + B.z * a*b) 35
36 Thread ID b y z x a c (x + y * a + z * a*b). 36
37 CUDA Memory Model Overview Global memory Main host-device data communication path Visible to all threads Long latency L1 and L2 exist on Fermi/Kepler Shared Memory Fast memory Use as scratch Shared across block More memory segments Constant and texture Read-only, cached Host Grid Block (0, 0) Registers Shared Memory Thread (0, 0) Global Memory Registers Thread (1, 0) Block (1, 0) Registers Shared Memory Thread (0, 0) Registers Thread (1, 0) 37 courtesy Kirk & Hwu
38 Memory Model Details Shared memory is tied to a block Lifetime ends with the block global, constant, and texture memories are persistent across kernels (within application) These are recognized as device memory Separate from host memory App must explicitly allocate/de-allocate device memory And manage data transfer between host and device memory 38
39 CUDA Device Memory Allocation cudamalloc() Allocates in global memory parameters: Address of a pointer to the allocated object Size of allocated object No graphics display reset cudafree() Frees object from global memory Called on the host! Appears like host pointers Grid Block (0, 0) Registers Global Memory Shared Memory Thread (0, 0) Host Registers Thread (1, 0) Block (1, 0) Shared Memory Registers Registers Thread (0, 0) Thread (1, 0) 39 Thanks Kirk & Hwu
40 CUDA Device Memory Allocation Code example: (cont.) Allocate a 64 * 64 single precision float array Attach the allocated storage to dm Prefix/Suffix d often used for device data TILE_WIDTH = 64; float *dm, *M; int size = TILE_WIDTH * TILE_WIDTH * sizeof(float); cudamalloc((void**)&dm, size); cudafree(dm); 40 Courtesy Kirk & Hwu
41 Example Memory Copy size_t size = N * sizeof(float); // Allocate vector in host memory float* ha = (float*)malloc(size); // Declare device vectors handles float *da, *db, *dc; // Allocate vectors in device global memory cudamalloc(&da, size); cudamalloc(&db, size); // Copy host->device cudamemcpy(da, ha, size, cudamemcpyhosttodevice); // Invoke kernel on GPU ProcessDo<<<blocksPerGrid, threadsperblock>>>(da, db, N); // Copy result, h_b, from device memory to host memory cudamemcpy(hb, db, size, cudamemcpydevicetohost); // Free device memory cudafree(da); cudafree(db); See cudamallocpitch() and cudamalloc3d() for allocating 2D/3D arrays. Pads to meet alignment for efficient access (also see cudamemcpy2d(), cudamemcpy3d(), cudamemcpytosymbol() ). 41
42 Example Host-Device Data Transfer cudamemcpy(dm, M, size, cudamemcpyhosttodevice); cudamemcpy(m, dm, size, cudamemcpydevicetohost); Transfers a 64 * 64 float array M is in host memory and dm is in device memory cudamemcpyhosttodevice and cudamemcpydevicetohost are symbolic constants Also see cudamemcpyasync(). Recall allocazon earlier 42 Courtesy Kirk & Hwu
43 More Ways to Initialize constant float constdata[256]; float data[256]; cudamemcpytosymbol(constdata, data, sizeof(data)) There also is page-locked (i.e., pinned) host memory cudahostalloc() and cudafreehost() Copies between page-locked host memory and device memory can be performed concurrently with kernel execution Page-locked host memory can be directly mapped into the address space of the device Bandwidth between host memory and device memory is generally higher 43
44 Memory Mapping A block of page-locked host memory can mapped into the address space of the device Use flag cudahostallocmapped in cudahostalloc() cudahostalloc() returns host memory pointer device memory pointer can be retrieved using cudahostgetdevicepointer() Accessing host-mapped memory from within a kernel has some advantages: data transfers are implicitly performed as needed by the kernel kernel-originated data transfers can overlap with kernel execution No need to use streams to overlap data transfers with kernel execution Application must synchronize memory accesses using streams or events to prevent RAW, WAR, or WAW hazards. 44
45 Memory Access Efficiency All (active) threads of the warp load/store Concurrent accesses by a warp can be coalesced into memory transactions Shared memory distributed into banks Non-conflicting access efficient Atomic operations are serialized 45
46 CUDA Function Qualifiers device float dsomename() {} global void ksomename() {} global defines a kernel function Must return void called from host, run on device No recursion device are executed and called on device host qualifier also exists device and host can be used together 46
47 CUDA Function Declarations (cont.) device functions inlined for Tesla can noinline and forceinline recursion is not supported in older versions For functions executed on the device: No static variables inside the function No variable number of arguments 47 Courtesy Kirk & Hwu
48 Calling a Kernel Function Thread Creation kernels called with execuzon configurazon: global void KernelFunc(...); dim3 DimGrid(100, 50); // 5000 thread blocks dim3 DimBlock(4, 8, 8); // 256 threads/block size_t SharedMemBytes = 64; // 64 bytes of shared mem KernelFunc<<< DimGrid, DimBlock, SharedMemBytes >>>(...); Dynamically allocated. More later. Calls to a kernel function are asynchronous Parameters are passed through shared/constant memory Provision for explicit sync for blocking 48 Courtesy Kirk & Hwu
49 CUDA Variable Qualifiers device Can be used in conjunction with qualifiers below Resides in global memory; visible to all threads Accessible from host through the runtime library constant Read only on device shared Has the lifetime of the block Accessible only from threads of that block extern allowed only with a single array cannot be initialized at declaration Writes to non-volatile variables may be delayed syncthreads() required to ensure visibility across threads File scope only: No extern No struct/union in formal param No struct/union on local vars Cannot be used on host-local variables constant/shared are static 49
50 External Shared Memory Variables Instead of: extern shared short array0[128]; extern shared float array1[64]; extern shared int array2[256]; Do extern shared char array[]; device void func() // or global { short* array0 = (short*)array; float* array1 = (float*)&array0[128]; int* array2 = (int*)&array1[64]; } 50
51 Memory Usage Local variables in register or local memory Pointers in device code must be resolved at compile time Whether pointing to shared or global memory [1.x] Following a global memory pointer in host code is not allowed The address obtained by taking the address of a device, shared or constant variable can only be used in device code. The address of a device or constant variable queried using cudagetsymboladdress() on host 51
52 Built-in Vector Types char1, uchar1, char2, uchar2, char3, uchar3, char4, uchar4, short1, ushort1, short2, ushort2, short3, ushort3, short4, ushort4, int1, uint1, int2, uint2, int3, uint3, int4, uint4, long1, ulong1, long2, ulong2, long3, ulong3, long4, ulong4, float1, float2, float3, float4, double2 dim3 int2 point = make_int2(int x, int y); 52
53 void aplusb(int n, float *restrict a, float *b) { #pragma acc kernels for( int j = 1; j < n-1; j++) { #pragma acc kernels for( int i = 1; i < m-1; i++ ) { for (int i = 0; i < n; i++) a[i] = a[i] + b[i]; } } #pragma acc data copy(a, Anew) #pragma acc kernels #pragma acc data copy(a, Anew) while ( error OpenACC > tol && iter < iter_max ) { error = 0.f; } Anew[j][i] = 0.25f * ( A[j][i+1] + A[j][i-1]+ A[j-1][i] + A[j+1][i]); error = maxf(error, fabsf(anew[j][i]-a[j][i])); #pragma acc kernels for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } Jacobi Iteration } iter++; 53
54 OpenCL Execution Model: Kernels are executed by one or more work-items. Work-items are collected into work-groups and each work-group executes on a compute unit. Memory Model Kernel data must be specifically placed in one of four address spaces global memory, constant memory, local memory, or private memory. The location of the data determines how quickly it can be processed.
55 Matrix Multiplication Example Illustrates basic memory/thread management Assumes square matrix for simplicity No shared memory usage yet Local, register usage Thread ID usage Memory data transfer between host and device 55 Courtesy Kirk & Hwu
56 Square Matrix Multiplication P = M * N of size WIDTH x WIDTH Without tiling: One thread calculates one element of P M and N are loaded WIDTH times from global memory M P N N WIDTH M P WIDTH WIDTH WIDTH 56 Courtesy Kirk & Hwu
57 MATMUL: A Simple Host Version in C // Matrix multiplication on the (CPU) host in double precision void MatrixMulOnHost(float* M, float* N, float* P, int Width) N { } for (int i = 0; i < Width; ++i) for (int j = 0; j < Width; ++j) { double sum = 0; for (int k = 0; k < Width; ++k) { double a = M[i * width + k]; double b = N[k * width + j]; sum += a * b; M } P[i * Width + j] = sum; } k i P j k WIDTH WIDTH 38 WIDTH WIDTH 57 Courtesy Kirk & Hwu
58 MATMUL: Matrix Data Transfer void MatrixMulOnDevice(float* M, float* N, float* P, int Width) { int size = Width * Width * sizeof(float); float *Md, *Nd, *Pd; 1. // Allocate and Load M, N to device memory cudamalloc(&md, size); cudamemcpy(md, M, size, cudamemcpyhosttodevice); cudamalloc(&nd, size); cudamemcpy(nd, N, size, cudamemcpyhosttodevice); // Allocate P on the device cudamalloc(&pd, size); 2. // Kernel invocation code to be shown later 3. // Read P from the device cudamemcpy(p, Pd, size, cudamemcpydevicetohost); // Free device matrices cudafree(md); cudafree(nd); cudafree (Pd); 58 Courtesy Kirk & Hwu }
59 MATMUL: Kernel Function // Matrix mulmplicamon kernel per thread code global void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) { // Pvalue stores the matrix element computed by the thread float Pvalue = 0; for (int k = 0; k < Width; ++k) { float Melement = Md[threadIdx.y*Width+k]; float Nelement = Nd[k*Width+threadIdx.x]; Pvalue += Melement * Nelement; } } Pd[threadIdx.y*Width+threadIdx.x] = Pvalue; 41 59
60 MATMUL: Matrix Data Transfer void MatrixMulOnDevice(float* M, float* N, float* P, int Width) { int size = Width * Width * sizeof(float); float *dm, *dn, *dp; 1. // Allocate and Load M, N to device memory cudamalloc(&dm, size); cudamemcpy(dm, M, size, cudamemcpyhosttodevice); cudamalloc(&dn, size); cudamemcpy(dn, N, size, cudamemcpyhosttodevice); // Allocate P on the device cudamalloc(&dp, size); 2. // Kernel invocation code to be shown later 3. // Read P from the device cudamemcpy(p, dp, size, cudamemcpydevicetohost); // Free device matrices cudafree(dm); cudafree(dn); cudafree (dp); 60 Courtesy Kirk & Hwu }
61 MATMUL: Matrix Data Transfer void MatrixMulOnDevice(float* M, float* N, float* P, int Width) { int size = Width * Width * sizeof(float); float *dm, *dn, *dp; 1. // Allocate and Load M, N to device memory cudamalloc(&dm, size); cudamemcpy(dm, M, size, cudamemcpyhosttodevice); cudamalloc(&dn, size); cudamemcpy(dn, N, size, cudamemcpyhosttodevice); // Allocate P on the device cudamalloc(&dp, size); 2. // Setup the execution configuration dim3 dimgrid(1, 1); dim3 dimblock(width, Width); // Launch the device computation threads! MatrixMulKernel<<<dimGrid, dimblock>>>(dm, dn, dp, Width); 3. // Read P from the device cudamemcpy(p, dp, size, cudamemcpydevicetohost); // Free device matrices cudafree(dm); cudafree(dn); cudafree (dp); } 61
62 Single Thread Blocks One thread-block computes Pd Each thread computes one element of Pd Each thread Loads a row of matrix Md Loads a column of matrix Nd Perform one multiply and addition for each pair of Md and Nd elements Compute to off-chip memory access ratio close to 1:1 (not very high) Size of matrix limited by the number of threads allowed in a thread block Grid 1 Block 1 Thread (2, 2) Nd 48 WIDTH Md Pd 62 Courtesy Kirk & Hwu
63 MATMUL: Multiple Blocks 2D thread block computes a (TILE_WIDTH) 2 sub-matrix of Pd (TILE_WIDTH) 2 threads/block Generate a 2D Grid of (WIDTH/ TILE_WIDTH) 2 blocks Nd WIDTH Smll need a loop around the kernel call for cases where WIDTH/TILE_WIDTH is greater than max grid size! Md Pd bx by TILE_WIDTH ty tx WIDTH WIDTH 45 WIDTH 63 Courtesy Kirk & Hwu
64 MATMUL: Multiple Blocks 2D thread block computes a (TILE_WIDTH) 2 sub-matrix of Pd (TILE_WIDTH) 2 threads/block Nd tx WIDTH Generate a 2D Grid of (WIDTH/ TILE_WIDTH) 2 blocks If grid size required is greater than maximum allowed, invoke mulmple kernels. Md ty Pd by TILE_WIDTH tx ty k bx, by WIDTH WIDTH WIDTH 64
65 Synchronization (Block) void syncthreads() block barrier AND ensure all global/shared memory accesses by all threads are visible in the block int syncthreads_count(int predicate) returns the count of threads where predicate!= 0 int syncthreads_and(int predicate) returns non-zero iff predicate!= 0 for all threads int syncthreads_or(int predicate) returns non-zero iff predicate!= 0 for any threads
66 Intra Warp Synchronization int all(int predicate) Return non-zero iff predicate!= 0 for all threads int any(int predicate) Return non-zero iff predicate!= 0 for any threads uint ballot(int predicate) Return an int with nth bit set iff predicate!= 0 for the nth thread of the warp Only supported by devices of compute capability above
67 Atomic Operations Read-modify-write on one 32-bit or 64-bit word in global or shared memory. Atomic ops on 64-bit words in shared memory available since 2.0 If the target is mapped page-locked memory: it s not atomic from host s perspective Only supported on signed and unsigned integers except atomicexch and atomicadd() in 2.0 that also work for float. More generic Compare and Swap atomiccas() 67
68 CAS Example device double atomicadd(double* address, double val) { double old = *address, assumed; do { assumed = old; old = longlong_as_double( atomiccas( (unsigned long long int*)address, double_as_longlong(assumed), double_as_longlong(val+ assumed) )); } while (assumed!= old); return old; } 68
69 Thrust Template Library #include <thrust/host_vector.h> #include <thrust/device_vector.h> int main(void) { thrust::host_vector<int> h_vec(1 << 24); // Initialize host array now.. } // transfer data to the device thrust::device_vector<int> d_vec = h_vec; // Do more.. // Later transfer data to the device thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin()); return 0; 69
70 Checking Return Values Functions return cudaerror_t cudasuccess cudaerrorinvalidvalue cudaerrorinvalidsymbol cudaerrorinvaliddevicepointer cudaerrorinvalidmemcpydirection 70
71 CUDA Sort Example #define NUM 256 int main(int argc, char** argv) { } int values[num]; for(int i = 0; i < NUM; i++) values[i] = getnextvalue(); int *dvalues; cudamalloc((void**)&dvalues, sizeof(int) * NUM); cudamemcpy(dvalues, values, sizeof(int)*num, cudamemcpyhosttodevice)); ParallelSort<<<1,NUM, sizeof(int)*num>>> (dvalues); cudamemcpy(values, dvalues, sizeof(int)* NUM, cudamemcpydevicetohost)); cudafree(dvalues); Contd.. 71
72 device inline void swap(int &a, int &b) { int tmp = a; // We expect these in register a = b; b = tmp; } global static void ParallelSort(int * values) { extern shared int shared[]; const unsigned int tid = threadidx.x; // Copy input to shared mem. shared[tid] = values[tid]; syncthreads(); Sort Kernel pg 1 72
73 } // Parallel bitonic sort. for (unsigned int k = 2; k <= NUM; k >>= 1) { } for (unsigned int j = k/2; j>0; j <<= 2) { } } unsigned int ixj = tid ^ j; if (ixj > tid) { if ((tid & k) == 0) { if (shared[tid] > shared[ixj]) } else { } swap(shared[tid], shared[ixj]); if (shared[tid] < shared[ixj]) syncthreads(); swap(shared[tid], shared[ixj]); values[tid] = shared[tid]; // Write result. Sort Kernel pg 73 2
General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop
General Purpose GPU programming (GP-GPU) with Nvidia CUDA Libby Shoop 3 What is (Historical) GPGPU? General Purpose computation using GPU and graphics API in applications other than 3D graphics GPU accelerates
More informationCIS 665: GPU Programming. Lecture 2: The CUDA Programming Model
CIS 665: GPU Programming Lecture 2: The CUDA Programming Model 1 Slides References Nvidia (Kitchen) David Kirk + Wen-Mei Hwu (UIUC) Gary Katz and Joe Kider 2 3D 3D API: API: OpenGL OpenGL or or Direct3D
More informationGPGPU. Lecture 2: CUDA
GPGPU Lecture 2: CUDA GPU is fast Previous GPGPU Constraints Dealing with graphics API Working with the corner cases of the graphics API Addressing modes Limited texture size/dimension Shader capabilities
More informationComputation to Core Mapping Lessons learned from a simple application
Lessons learned from a simple application Matrix Multiplication Used as an example throughout the course Goal for today: Show the concept of Computation-to-Core Mapping Block schedule, Occupancy, and thread
More informationLessons learned from a simple application
Computation to Core Mapping Lessons learned from a simple application A Simple Application Matrix Multiplication Used as an example throughout the course Goal for today: Show the concept of Computation-to-Core
More informationGPU Computation CSCI 4239/5239 Advanced Computer Graphics Spring 2014
GPU Computation CSCI 4239/5239 Advanced Computer Graphics Spring 2014 Solutions to Parallel Processing Message Passing (distributed) MPI (library) Threads (shared memory) pthreads (library) OpenMP (compiler)
More informationGPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34
1 / 34 GPU Programming Lecture 2: CUDA C Basics Miaoqing Huang University of Arkansas 2 / 34 Outline Evolvements of NVIDIA GPU CUDA Basic Detailed Steps Device Memories and Data Transfer Kernel Functions
More informationMatrix Multiplication in CUDA. A case study
Matrix Multiplication in CUDA A case study 1 Matrix Multiplication: A Case Study Matrix multiplication illustrates many of the basic features of memory and thread management in CUDA Usage of thread/block
More informationUsing The CUDA Programming Model
Using The CUDA Programming Model Leveraging GPUs for Application Acceleration Dan Ernst, Brandon Holt University of Wisconsin Eau Claire 1 What is (Historical) GPGPU? General Purpose computation using
More informationThis is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.
David Kirk/NVIDIA and Wen-mei Hwu, 2006-2008 This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC. Please send any comment to dkirk@nvidia.com
More informationLecture 1: Introduction
ECE 498AL Applied Parallel Programming Lecture 1: Introduction 1 Course Goals Learn how to program massively parallel processors and achieve high performance functionality and maintainability scalability
More informationCS 677: Parallel Programming for Many-core Processors Lecture 1
1 CS 677: Parallel Programming for Many-core Processors Lecture 1 Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Objectives Learn how to program
More informationCUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.
Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication
More informationParallel Computing. Lecture 19: CUDA - I
CSCI-UA.0480-003 Parallel Computing Lecture 19: CUDA - I Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com GPU w/ local DRAM (device) Behind CUDA CPU (host) Source: http://hothardware.com/reviews/intel-core-i5-and-i7-processors-and-p55-chipset/?page=4
More informationOutline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun
Outline Memory Management CIS 565 Fall 2011 Qing Sun sunqing@seas.upenn.edu Kernels Matrix multiplication Managing Memory CPU and GPU have separate memory spaces Host (CPU) code manages device (GPU) memory
More informationCUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17
CUDA Lecture 2 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de December 15, 2015 CUDA Programming Fundamentals CUDA
More informationLecture 3: Introduction to CUDA
CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Introduction to CUDA Some slides here are adopted from: NVIDIA teaching kit Mohamed Zahran (aka Z) mzahran@cs.nyu.edu
More informationMany-core Processors Lecture 1. Instructor: Philippos Mordohai Webpage:
1 CS 677: Parallel Programming for Many-core Processors Lecture 1 Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Objectives Learn how to program
More informationIntroduction to GPGPUs and to CUDA programming model
Introduction to GPGPUs and to CUDA programming model www.cineca.it Marzia Rivi m.rivi@cineca.it GPGPU architecture CUDA programming model CUDA efficient programming Debugging & profiling tools CUDA libraries
More informationBasic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Basic Elements of CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of
More informationWhat is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms
CS 590: High Performance Computing GPU Architectures and CUDA Concepts/Terms Fengguang Song Department of Computer & Information Science IUPUI What is GPU? Conventional GPUs are used to generate 2D, 3D
More informationCSE 591: GPU Programming. Programmer Interface. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591: GPU Programming Programmer Interface Klaus Mueller Computer Science Department Stony Brook University Compute Levels Encodes the hardware capability of a GPU card newer cards have higher compute
More informationECE 574 Cluster Computing Lecture 15
ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements
More informationCOMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers
COMP 322: Fundamentals of Parallel Programming Lecture 37: General-Purpose GPU (GPGPU) Computing Max Grossman, Vivek Sarkar Department of Computer Science, Rice University max.grossman@rice.edu, vsarkar@rice.edu
More informationTesla Architecture, CUDA and Optimization Strategies
Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationIntroduction to CELL B.E. and GPU Programming. Agenda
Introduction to CELL B.E. and GPU Programming Department of Electrical & Computer Engineering Rutgers University Agenda Background CELL B.E. Architecture Overview CELL B.E. Programming Environment GPU
More informationIntro to GPU s for Parallel Computing
Intro to GPU s for Parallel Computing Goals for Rest of Course Learn how to program massively parallel processors and achieve high performance functionality and maintainability scalability across future
More informationIntroduction to CUDA (2 of 2)
Announcements Introduction to CUDA (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2012 Open pull request for Project 0 Project 1 released. Due Sunday 09/30 Not due Tuesday, 09/25 Code
More informationModule 2: Introduction to CUDA C. Objective
ECE 8823A GPU Architectures Module 2: Introduction to CUDA C 1 Objective To understand the major elements of a CUDA program Introduce the basic constructs of the programming model Illustrate the preceding
More informationIntroduction to GPU programming. Introduction to GPU programming p. 1/17
Introduction to GPU programming Introduction to GPU programming p. 1/17 Introduction to GPU programming p. 2/17 Overview GPUs & computing Principles of CUDA programming One good reference: David B. Kirk
More informationParallel Accelerators
Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon
More informationParallel Accelerators
Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon
More informationIntroduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research
Introduction to CUDA CME343 / ME339 18 May 2011 James Balfour [ jbalfour@nvidia.com] NVIDIA Research CUDA Programing system for machines with GPUs Programming Language Compilers Runtime Environments Drivers
More informationCUDA C Programming Mark Harris NVIDIA Corporation
CUDA C Programming Mark Harris NVIDIA Corporation Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory Models Programming Environment
More informationBasic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Basic Elements of CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied
More informationGPU Architecture and Programming. Ozcan Ozturk Bilkent University Ankara, Turkey
GPU Architecture and Programming Ozcan Ozturk Ankara, Turkey 1 Overview Massively Parallel Processing CPU/GPU Architecture CUDA Programming OpenCL Programming Other Accelerators 2 Overview Massively Parallel
More informationCUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list
CUDA Programming Week 1. Basic Programming Concepts Materials are copied from the reference list G80/G92 Device SP: Streaming Processor (Thread Processors) SM: Streaming Multiprocessor 128 SP grouped into
More informationCUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA
CUDA PROGRAMMING MODEL Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA: COMMON UNIFIED DEVICE ARCHITECTURE Parallel computing architecture and programming model GPU Computing Application Includes
More informationOverview: Graphics Processing Units
advent of GPUs GPU architecture Overview: Graphics Processing Units the NVIDIA Fermi processor the CUDA programming model simple example, threads organization, memory model case study: matrix multiply
More informationEE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 22 CUDA
EE382 (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 22 CUDA Mattan Erez The University of Texas at Austin EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 22
More informationHPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA
HPC COMPUTING WITH CUDA AND TESLA HARDWARE Timothy Lanfear, NVIDIA WHAT IS GPU COMPUTING? What is GPU Computing? x86 PCIe bus GPU Computing with CPU + GPU Heterogeneous Computing Low Latency or High Throughput?
More informationData Parallel Execution Model
CS/EE 217 GPU Architecture and Parallel Programming Lecture 3: Kernel-Based Data Parallel Execution Model David Kirk/NVIDIA and Wen-mei Hwu, 2007-2013 Objective To understand the organization and scheduling
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control
More informationModule 2: Introduction to CUDA C
ECE 8823A GPU Architectures Module 2: Introduction to CUDA C 1 Objective To understand the major elements of a CUDA program Introduce the basic constructs of the programming model Illustrate the preceding
More informationIntroduction to CUDA Programming
Introduction to CUDA Programming Steve Lantz Cornell University Center for Advanced Computing October 30, 2013 Based on materials developed by CAC and TACC Outline Motivation for GPUs and CUDA Overview
More informationCUDA (Compute Unified Device Architecture)
CUDA (Compute Unified Device Architecture) Mike Bailey History of GPU Performance vs. CPU Performance GFLOPS Source: NVIDIA G80 = GeForce 8800 GTX G71 = GeForce 7900 GTX G70 = GeForce 7800 GTX NV40 = GeForce
More informationIntroduction to CUDA (1 of n*)
Administrivia Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Paper presentation due Wednesday, 02/23 Topics first come, first serve Assignment 4 handed today
More informationJosef Pelikán, Jan Horáček CGG MFF UK Praha
GPGPU and CUDA 2012-2018 Josef Pelikán, Jan Horáček CGG MFF UK Praha pepca@cgg.mff.cuni.cz http://cgg.mff.cuni.cz/~pepca/ 1 / 41 Content advances in hardware multi-core vs. many-core general computing
More informationAn Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center
An Introduction to GPU Architecture and CUDA C/C++ Programming Bin Chen April 4, 2018 Research Computing Center Outline Introduction to GPU architecture Introduction to CUDA programming model Using the
More informationLecture 2: Introduction to CUDA C
CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013 1 CUDA /OpenCL Execution Model Integrated host+device app C program Serial or
More informationLecture 11: GPU programming
Lecture 11: GPU programming David Bindel 4 Oct 2011 Logistics Matrix multiply results are ready Summary on assignments page My version (and writeup) on CMS HW 2 due Thursday Still working on project 2!
More informationScientific discovery, analysis and prediction made possible through high performance computing.
Scientific discovery, analysis and prediction made possible through high performance computing. An Introduction to GPGPU Programming Bob Torgerson Arctic Region Supercomputing Center November 21 st, 2013
More informationECE 574 Cluster Computing Lecture 17
ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 March 2019 HW#8 (CUDA) posted. Project topics due. Announcements 1 CUDA installing On Linux
More informationCUDA Programming Model
CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming
More informationCUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)
CUDA programming model N. Cardoso & P. Bicudo Física Computacional (FC5) N. Cardoso & P. Bicudo CUDA programming model 1/23 Outline 1 CUDA qualifiers 2 CUDA Kernel Thread hierarchy Kernel, configuration
More informationCUDA OPTIMIZATIONS ISC 2011 Tutorial
CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationCUDA Performance Optimization. Patrick Legresley
CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied
More informationAccelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include
3.1 Overview Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include GPUs (Graphics Processing Units) AMD/ATI
More informationKernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow
Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization
More informationCUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission)
CUDA Memory Model Some material David Kirk, NVIDIA and Wen-mei W. Hwu, 2007-2009 (used with permission) 1 G80 Implementation of CUDA Memories Each thread can: Grid Read/write per-thread registers Read/write
More informationLecture 5. Performance Programming with CUDA
Lecture 5 Performance Programming with CUDA Announcements 2011 Scott B. Baden / CSE 262 / Spring 2011 2 Today s lecture Matrix multiplication 2011 Scott B. Baden / CSE 262 / Spring 2011 3 Memory Hierarchy
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna14/ [ 10 ] GPU and CUDA Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture and Performance
More informationLecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)
Lecture 9 CUDA CUDA (I) Compute Unified Device Architecture 1 2 Outline CUDA Architecture CUDA Architecture CUDA programming model CUDA-C 3 4 CUDA : a General-Purpose Parallel Computing Architecture CUDA
More informationUniversity of Bielefeld
Geistes-, Natur-, Sozial- und Technikwissenschaften gemeinsam unter einem Dach Introduction to GPU Programming using CUDA Olaf Kaczmarek University of Bielefeld STRONGnet Summerschool 2011 ZIF Bielefeld
More informationBasics of CADA Programming - CUDA 4.0 and newer
Basics of CADA Programming - CUDA 4.0 and newer Feb 19, 2013 Outline CUDA basics Extension of C Single GPU programming Single node multi-gpus programing A brief introduction on the tools Jacket CUDA FORTRAN
More informationMaster Informatics Eng.
Advanced Architectures Master Informatics Eng. 2018/19 A.J.Proença Data Parallelism 3 (GPU/CUDA, Neural Nets,...) (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2018/19 1 The
More informationDevice Memories and Matrix Multiplication
Device Memories and Matrix Multiplication 1 Device Memories global, constant, and shared memories CUDA variable type qualifiers 2 Matrix Multiplication an application of tiling runningmatrixmul in the
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationStanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690
Stanford University NVIDIA Tesla M2090 NVIDIA GeForce GTX 690 Moore s Law 2 Clock Speed 10000 Pentium 4 Prescott Core 2 Nehalem Sandy Bridge 1000 Pentium 4 Williamette Clock Speed (MHz) 100 80486 Pentium
More informationProgramming with CUDA, WS09
Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller Lecture 3 Thursday, 29 Nov, 2009 Recap Motivational videos Example kernel Thread IDs Memory overhead CUDA hardware and programming
More informationCUDA Architecture & Programming Model
CUDA Architecture & Programming Model Course on Multi-core Architectures & Programming Oliver Taubmann May 9, 2012 Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New
More informationCOSC 6374 Parallel Computations Introduction to CUDA
COSC 6374 Parallel Computations Introduction to CUDA Edgar Gabriel Fall 2014 Disclaimer Material for this lecture has been adopted based on various sources Matt Heavener, CS, State Univ. of NY at Buffalo
More informationCUDA C/C++ BASICS. NVIDIA Corporation
CUDA C/C++ BASICS NVIDIA Corporation What is CUDA? CUDA Architecture Expose GPU parallelism for general-purpose computing Retain performance CUDA C/C++ Based on industry-standard C/C++ Small set of extensions
More informationCS516 Programming Languages and Compilers II
CS516 Programming Languages and Compilers II Zheng Zhang Spring 2015 Jan 29 GPU Programming II Rutgers University Review: Programming with CUDA Compute Unified Device Architecture (CUDA) Mapping and managing
More informationCSE 591: GPU Programming. Memories. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591: GPU Programming Memories Klaus Mueller Computer Science Department Stony Brook University Importance of Memory Access Efficiency Every loop iteration has two global memory accesses two floating
More informationLecture 8: GPU Programming. CSE599G1: Spring 2017
Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library
More informationAn Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture
An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture Rafia Inam Mälardalen Real-Time Research Centre Mälardalen University, Västerås, Sweden http://www.mrtc.mdh.se rafia.inam@mdh.se CONTENTS
More informationGPU Fundamentals Jeff Larkin November 14, 2016
GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate
More informationFundamental Optimizations
Fundamental Optimizations Paulius Micikevicius NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access
More informationAdvanced CUDA Programming. Dr. Timo Stich
Advanced CUDA Programming Dr. Timo Stich (tstich@nvidia.com) Outline SIMT Architecture, Warps Kernel optimizations Global memory throughput Launch configuration Shared memory access Instruction throughput
More informationMathematical computations with GPUs
Master Educational Program Information technology in applications Mathematical computations with GPUs CUDA Alexey A. Romanenko arom@ccfit.nsu.ru Novosibirsk State University CUDA - Compute Unified Device
More informationCUDA C/C++ BASICS. NVIDIA Corporation
CUDA C/C++ BASICS NVIDIA Corporation What is CUDA? CUDA Architecture Expose GPU parallelism for general-purpose computing Retain performance CUDA C/C++ Based on industry-standard C/C++ Small set of extensions
More informationCS 314 Principles of Programming Languages
CS 314 Principles of Programming Languages Zheng Zhang Fall 2016 Dec 14 GPU Programming Rutgers University Programming with CUDA Compute Unified Device Architecture (CUDA) Mapping and managing computations
More informationIntroduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model
Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Dirk Ribbrock Faculty of Mathematics, TU dortmund 2016 Table of Contents Why parallel
More informationGPU Programming Using CUDA
GPU Programming Using CUDA Michael J. Schnieders Depts. of Biomedical Engineering & Biochemistry The University of Iowa & Gregory G. Howes Department of Physics and Astronomy The University of Iowa Iowa
More informationIntroduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series
Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca (Slides http://support.scinet.utoronto.ca/ northrup/westgrid CUDA.pdf) March 12, 2014
More informationCUDA Advanced Techniques 2 Mohamed Zahran (aka Z)
CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming CUDA Advanced Techniques 2 Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Alignment Memory Alignment Memory
More informationMemory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.
Memory concept Grid concept, Synchronization GPU Programming http://cuda.nik.uni-obuda.hu Szénási Sándor szenasi.sandor@nik.uni-obuda.hu GPU Education Center of Óbuda University MEMORY CONCEPT Off-chip
More informationHigh Performance Linear Algebra on Data Parallel Co-Processors I
926535897932384626433832795028841971693993754918980183 592653589793238462643383279502884197169399375491898018 415926535897932384626433832795028841971693993754918980 592653589793238462643383279502884197169399375491898018
More informationIntroduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series
Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca March 13, 2014 Outline 1 Heterogeneous Computing 2 GPGPU - Overview Hardware Software
More informationCS671 Parallel Programming in the Many-Core Era
CS671 Parallel Programming in the Many-Core Era Lecture 3: GPU Programming - Reduce, Scan & Sort Zheng Zhang Rutgers University Review: Programming with CUDA An Example in C Add vector A and vector B to
More informationIntroduction to Parallel Computing with CUDA. Oswald Haan
Introduction to Parallel Computing with CUDA Oswald Haan ohaan@gwdg.de Schedule Introduction to Parallel Computing with CUDA Using CUDA CUDA Application Examples Using Multiple GPUs CUDA Application Libraries
More informationThis is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.
David Kirk/NVIDIA and Wen-mei Hwu, 2006-2008 This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC. Please send any comment to dkirk@nvidia.com
More informationFundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA
Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU
More informationIntroduction to GPU (Graphics Processing Unit) Architecture & Programming
Introduction to GU (Graphics rocessing Unit) Architecture & rogramming C240A. 2017 T. Yang ome of slides are from M. Hall of Utah C6235 Overview Hardware architecture rogramming model Example Historical
More informationOverview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
More informationCUDA Parallel Programming Model Michael Garland
CUDA Parallel Programming Model Michael Garland NVIDIA Research Some Design Goals Scale to 100s of cores, 1000s of parallel threads Let programmers focus on parallel algorithms not mechanics of a parallel
More information