Accelerate with GPUs Harnessing GPGPUs with trending technologies
|
|
- Job Wells
- 5 years ago
- Views:
Transcription
1 Accelerate with GPUs Harnessing GPGPUs with trending technologies Anubhav Jain and Amit Kalele Parallelization and Optimization CoE Tata Consultancy Services Ltd. Copyright 2016 Tata Consultancy Services Limited 1
2 Outline Overview of GPUs HW architecture Parallelism, CPU/GPU model Programming technologies Introduction to GPU programming, Tools, SDK etc. OpenACC directive based approach Introduction to CUDA C Nvidia specific approach OpenCL generic approach HIP AMD specific approach Advance programming example CUDA approach 2
3 Basic Terminology Host CPU CPU Memory (Host Memory) Device GPU GPU Memory (Device Memory) 3
4 CPU & GPU Architecture Control ALU ALU ALU ALU Cache DRAM CPU Low Latency/Throughput Less ALUs Powerful ALUs DRAM GPU High Latency/Throughput More ALUs Simpler ALUs 4
5 CPU Vs GPU Parallelism CPUs use task/data parallelism Multiple tasks map to multiple threads Tasks run different instructions 10s of relatively heavyweight threads run on 10s of cores Each thread has to be individually programmed GPUs use data parallelism SIMT model (Single Instruction Multiple Thread) Same instruction on different data 10,000s of lightweight threads on 100s of cores Programming done for batches of threads (e.g. block of threads) 5
6 CPU-GPU Model CPU GPU Memory PCIe-x16 Memory CPUs Host or Master, offloads work to GPU GPUs Device or Co processor, worker 6
7 GPGPU Programming Technologies AMD s HIP OpenACC OpenCL CUDA GPGPU Terminology GPGPU.org 7
8 GPU Streaming Multiprocessor Nvidia Tesla K40 15 SMX Total Cores: Cores/SMX 8
9 GPU Memory Model (CUDA) Local Private to Each thread Shared Visible to all threads within Block Fast on-chip memory Like user managed cache Global All threads within Grid Texture/Constant All threads within Grid 9
10 Simple Program Workflow Host Device CPU CPU Memory PCI Bus CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE 1. Copying input data from host to device 1 CORE CORE CORE CORE GPU Memory 10
11 Simple Program Workflow Host CPU CPU Memory 2 PCI Bus Device CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE 1. Copying input data from host to device 2. Load device code and execute 1 CORE CORE CORE CORE GPU Memory 11
12 Simple Program Workflow Host CPU CPU Memory 2 PCI Bus Device CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE 1. Copying input data from host to device 2. Load device code and execute 3. Copying results back to host 1 3 CORE CORE CORE CORE GPU Memory 12
13 OpenACC Open Accelerators: Developed by Cray, CAPS, Nvidia and PGI PGI s OpenACC supports Nvidia Tesla, AMD Radeon, Intel & AMD multicore processors Supports C, C++, Fortran PGI Compilers: pgcc, pgc++ and pgfortran Directive Based Approach 13
14 OpenACC: vectoradd A B [0] [1] [2].. [31] [32] [33] [34].. [63] [64] [65] [66].. [101] [0] [1] [2].. [31] [32] [33] [34].. [63] [64] [65] [66].. [101] Parallel Vector Addition C [0] [1] [2].. [31] [32] [33] [34].. [63] [64] [65] [66].. [101] #pragma acc kernels loop copyin(a[0:n],b[0:n]) copyout(c[0:n]) for( int i = 0; i < N; ++i ) c[i] = a[i] + b[i]; Allocate a,b,c on device Copy a,b from host to device Copy c from device to host #pragma acc parallel loop for(i=0; i<n; ++i) c[i] = a[i] + b[i]; 14
15 OpenACC: vectoradd The OpenACC execution model has three levels: gang, worker, and vector gang==block, worker==warp, and vector==threads #pragma acc kernels loop gang(100), vector(128) for( int i = 0; i < N; ++i ) c[i] = a[i] + b[i]; Spawn 100 CUDA Blocks Spawn 128 Threads/Block Helps in Fine tuning 15
16 OpenACC: vectoradd pgcc -ta = (target accelerator) vectoradd.c -ta=tesla[:cuda7.0 cuda7.5 cuda8.0 ] -ta=radeon -ta=host -ta=multicore OpenACC Learning reference: 16
17 CUDA Uses CUDA-enabled Nvidia GPUs for GPGPU Processing Developed by Nvidia Supports C, C++, PGI CUDA Fortran Third party wrappers are also available for Java, Perl, Python, MATLAB etc. CUDA Compiler: nvcc Splits source file into host and device components CUDA complier handles device code Standard host compiler handles host code 17
18 CUDA Device code /*Device Kernel*/ global void kernel(... ) { Host code int main() { /*Allocate space on device*/ cudamalloc( ); /*Copying data from host to device*/ cudamemcpy(, cudamemcpyhosttodevice); /*Launch of GPU kernel*/ kernel<<<gridsize, blocksize>>>( ); /*Copying data from device to host*/ cudamemcpy(, cudamemcpydevicetohost); /*Cleanup*/ cudafree( ); Serial region (Host) Parallel region (Device) Serial region (Host) return 0; 18
19 VectorAdd Example using CUDA #include <cuda_runtime.h> int main(void) { int numelements = 50000; size_t size = numelements * sizeof(float); // Allocate the host vectors float *A = (float *)malloc(size); float *B = (float *)malloc(size); float *C = (float *)malloc(size); // Initialize the host input vectors for (int i = 0; i < numelements; ++i) { A[i] = rand()/(float)rand_max; B[i] = rand()/(float)rand_max; // Device Calls here // Free host memory free(a); free(b); free(c); // Allocate the device vectors float *d_a, *d_b, *d_c; cudamalloc((void **)&d_a, size); cudamalloc((void **)&d_b, size); cudamalloc((void **)&d_c, size); // Copy input vectors from host to device cudamemcpy(d_a, A, size, cudamemcpyhosttodevice); cudamemcpy(d_b, B, size, cudamemcpyhosttodevice); // Launch the Vector Add CUDA Kernel int threadsperblock = 256; int blockspergrid =(numelements + threadsperblock - 1) / threadsperblock; vectoradd<<<blockspergrid, threadsperblock>>>(d_a, d_b, d_c, numelements); // Copy the device result vector to host cudamemcpy(c, d_c, size, cudamemcpydevicetohost); // Free device global memory cudafree(d_a); cudafree(d_b); cudafree(d_c); // Device kernel global void vectoradd(const float *A, const float *B, float *C, int numelements) { int i = blockdim.x * blockidx.x + threadidx.x; if (i < numelements) { C[i] = A[i] + B[i]; cudadevicereset(); return 0; 19
20 VectorAdd Example using CUDA #include <cuda_runtime.h> int main(void) { int numelements = 50000; size_t size = numelements * sizeof(float); // Allocate the host vectors float *A = (float *)malloc(size); float *B = (float *)malloc(size); float *C = (float *)malloc(size); // Initialize the host input vectors for (int i = 0; i < numelements; ++i) { A[i] = rand()/(float)rand_max; B[i] = rand()/(float)rand_max; // Device Calls here Host Side data allocation & initialization Device Calls : NEXT SLIDE // Free host memory free(a); free(b); free(c); cudadevicereset(); return 0; 20
21 VectorAdd Example using CUDA // Allocate the device vectors float *d_a, *d_b, *d_c; cudamalloc((void **)&d_a, size); cudamalloc((void **)&d_b, size); cudamalloc((void **)&d_c, size); // Copy input vectors from host to device cudamemcpy(d_a, A, size, cudamemcpyhosttodevice); cudamemcpy(d_b, B, size, cudamemcpyhosttodevice); // Launch the Vector Add CUDA Kernel int threadsperblock = 256; int blockspergrid =(numelements + threadsperblock - 1) / threadsperblock; vectoradd<<<blockspergrid, threadsperblock>>>(d_a, d_b, d_c, numelements); // Copy the device result vector to host cudamemcpy(c, d_c, size, cudamemcpydevicetohost); Device side arrays allocation Copying data from host to device Calling kernel : NEXT SLIDE Copying results from host to device // Free device global memory cudafree(d_a); cudafree(d_b); cudafree(d_c); Free device side arrays 21
22 VectorAdd Example using CUDA // Device kernel global void vectoradd(const float *A, const float *B, float *C, int numelements) { int i = blockdim.x * blockidx.x + threadidx.x; if (i < numelements) { C[i] = A[i] + B[i]; 22
23 Thread Hierarchy thread block grid Kernel is executed as grid of blocks of threads Nvidia Tesla K20 (Kepler) architecture 23
24 VectorAdd Example using CUDA threadidx.x threadidx.x threadidx.x // Device kernel global void vectoradd(const float *A, const float *B, float *C, int numelements) { int i = blockdim.x * blockidx.x + threadidx.x; blockidx.x=0 blockidx.x=1 blockidx.x= if (i < numelements) { A [0] [1] [2].. [31] [32] [33] [34].. [63] [64] [65] [66].. [101] + C[i] = A[i] + B[i]; B [0] [1] [2].. [31] [32] [33] [34].. [63] [64] [65] [66].. [101] = C [0] [1] [2].. [31] [32] [33] [34].. [63] [64] [65] [66].. [101] Device Memory 24
25 VectorAdd Example using CUDA Compile: nvcc vectoradd.cu -arch=sm_30 Compute capability of Nvidia Cards: 1.0, 2.0, 3.0, 3.5 Check here: cudagetdeviceproperties returns major and minor number (compute capability) CUDA SDK Tools: nvvp : Nvidia Visual Profiler nvprof : Command-line based Profiler cuda-gdb : Nvidia CUDA debugger CUDA Learning reference: 25
26 Nvidia Visual Profiler : nvvp Analysis View Timeline View GPU Details View PC Sampling View 26
27 OpenCL CPUs, GPUs, DSPs, FPGAs and other processors and hardware accelerators Implementations are available from AMD, Apple, IBM, Altera, Intel, Nvidia, Qualcomm, Samsung etc. Supports C, C++ Third party APIs for Python, Java,.NET etc. 27
28 Program Workflow CUDA OpenCL Create memory buffers on device Copy data host to device Read kernel Offline/Online Create OpenCL Context Create Command Queue Create memory buffers on device Copy results back to Host Execute Kernel Create OpenCL kernel Build the program Create program from kernel source Copy data host to device Cleanup device memory Set kernel arguments Execute Kernel Copy results back to Host Cleanup device memory 28
29 VectorAdd Example using OpenCL #include <CL/cl.h> int main(void) { const int LIST_SIZE = (1024*1024); int *A = (int*)malloc(sizeof(int)*list_size); int *B = (int*)malloc(sizeof(int)*list_size); for(i = 0; i < LIST_SIZE; i++) { A[i] = i; B[i] = LIST_SIZE - i; // Load the kernel source code into the array source_str fp = fopen("vector_add_kernel.cl", "r"); source_str = (char*)malloc(max_source_size); source_size = fread( source_str, 1, MAX_SOURCE_SIZE, fp); // Get platform and device information (Hidden Steps) // Create an OpenCL context cl_context context = clcreatecontext( NULL, 1, &device_id, NULL, NULL, &ret); // Create a command queue cl_command_queue command_queue = clcreatecommandqueue( context, device_id, 0, &ret); // Create memory buffers on the device for each vector cl_mem a_mem_obj = clcreatebuffer( context, CL_MEM_READ_ONLY, LIST_SIZE * sizeof(int), NULL, &ret); // Build the program ret = clbuildprogram( program, 1, &device_id, NULL, NULL, NULL); // Create the OpenCL kernel cl_kernel kernel = clcreatekernel( program, "vector_add", &ret); // Set the arguments of the kernel ret = clsetkernelarg( kernel, 0, sizeof(cl_mem), (void *)&a_mem_obj); ret = clsetkernelarg( kernel, 1, sizeof(cl_mem), (void *)&b_mem_obj); ret = clsetkernelarg( kernel, 2, sizeof(cl_mem), (void *)&c_mem_obj); // Execute the OpenCL kernel on the list size_t global_item_size = LIST_SIZE; // Process the entire lists size_t local_item_size = 64; // Divide work items into groups of 64 ret = clenqueuendrangekernel( command_queue, kernel, 1, NULL, &global_item_size, &local_item_size, 0, NULL, NULL); // Free host memory free(a); free(b); free(c); return 0; // Copy the lists A and B to their respective memory buffers ret = clenqueuewritebuffer( command_queue, a_mem_obj, CL_TRUE, 0, LIST_SIZE * sizeof(int), A, 0, NULL, NULL); // Create a program from the kernel source cl_program program = clcreateprogramwithsource( context, 1, (const char **)&source_str, (const size_t *)&source_size, &ret); // Read the memory buffer C on the device to the local variable C int *C = (int*)malloc(sizeof(int)*list_size); ret = clenqueuereadbuffer( command_queue, c_mem_obj, CL_TRUE, 0, LIST_SIZE * sizeof(int), C, 0, NULL, NULL); // Clean up ret = clflush( command_queue); ret = clreleasekernel( kernel); ret = clreleaseprogram( program); ret = clreleasememobject( a_mem_obj); ret = clreleasecommandqueue( command_queue); ret = clreleasecontext( context); 29
30 VectorAdd Example using OpenCL #include <CL/cl.h> int main(void) { const int LIST_SIZE = (1024*1024); int *A = (int*)malloc(sizeof(int)*list_size); int *B = (int*)malloc(sizeof(int)*list_size); for(i = 0; i < LIST_SIZE; i++) { A[i] = i; B[i] = LIST_SIZE - i; // Load the kernel source code into the array source_str fp = fopen("vector_add_kernel.cl", "r"); source_str = (char*)malloc(max_source_size); source_size = fread( source_str, 1, MAX_SOURCE_SIZE, fp); // Get platform and device information (Hidden Steps) // Create an OpenCL context cl_context context = clcreatecontext( NULL, 1, &device_id, NULL, NULL, &ret); // Create a command queue cl_command_queue command_queue = clcreatecommandqueue(context, device_id, 0, &ret); // Create memory buffers on the device for each vector cl_mem a_mem_obj = clcreatebuffer(context, CL_MEM_READ_ONLY, LIST_SIZE * sizeof(int), NULL, &ret); // Build the program ret = clbuildprogram(program, 1, &device_id, NULL, NULL, NULL); // Device kernel // Create the kernel OpenCL void kernel vector_add( global const int *A, cl_kernel kernel global = clcreatekernel(program, const int *B, global int "vector_add", *C) { &ret); // Set the arguments // get threadid of the kernel ret = clsetkernelarg(kernel, int i = get_global_id(0); 0, sizeof(cl_mem), (void *)&a_mem_obj); ret = clsetkernelarg(kernel, 1, sizeof(cl_mem), (void *)&b_mem_obj); ret = clsetkernelarg(kernel, // Do the operation 2, sizeof(cl_mem), (void *)&c_mem_obj); C[i] = A[i] + B[i]; // Execute the OpenCL kernel on the list size_t global_item_size = LIST_SIZE; // Process the entire lists size_t local_item_size = 64; // Divide work items into groups of 64 ret = clenqueuendrangekernel(command_queue, kernel, 1, NULL, &global_item_size, &local_item_size, 0, NULL, NULL); // Free host memory free(a); free(b); free(c); return 0; // Copy the lists A and B to their respective memory buffers ret = clenqueuewritebuffer(command_queue, a_mem_obj, CL_TRUE, 0, LIST_SIZE * sizeof(int), A, 0, NULL, NULL); // Create a program from the kernel source cl_program program = clcreateprogramwithsource(context, 1, (const char **)&source_str, (const size_t *)&source_size, &ret); // Read the memory buffer C on the device to the local variable C int *C = (int*)malloc(sizeof(int)*list_size); ret = clenqueuereadbuffer(command_queue, c_mem_obj, CL_TRUE, 0, LIST_SIZE * sizeof(int), C, 0, NULL, NULL); // Clean up ret = clflush(command_queue); ret = clreleasekernel(kernel); ret = clreleaseprogram(program); ret = clreleasememobject(a_mem_obj); ret = clreleasecommandqueue(command_queue); ret = clreleasecontext(context); 30
31 OpenCL Online/Offline kernel compilation Offline: Kernel binary is read in by the host code clcreateprogramwithbinary() Online: Kernel source file is read in by the host code clcreateprogramwithsource() 31
32 OpenCL Tools Compile: gcc vectoradd.c -I /usr/local/cuda-7.0/include -L /usr/local/cuda-7.0/lib64 lopencl Tools: CodeXL Powerful Debugging, Profiling & Analysis ( LTPV: Light Temporal Performance Viewer ( AMD gdebugger: Debugger & Memory Analyzer OpenCL Learning reference :
33 Terminology Comparison CUDA OpenCL CUDA OpenCL thread work-item threadidx.x get_local_id(0) block work-group blockidx.x get_group_id(0) grid NDRange blockdim.x get_local_size(0) global kernel <<< >>> clenqueuendrangekernel 33
34 Memory Model Comparison OpenCL CUDA 34
35 Memory Model Comparison CUDA OpenCL Global Memory Local Memory Global Memory Texture Memory Constant Memory Shared Memory Registers Constant Memory Local Memory Private Memory 35
36 HIP Heterogeneous-compute Interface for Portability Developed by AMD, allow developers to convert CUDA code to common C++ Supports C, C++ The resulting code can run through either CUDA NVCC or AMD HCC compilers Provides more choice in hardware and development tools. 36
37 HIP Code Conversion Workflow Hipify Tool to convert CUDA code to portable CPP Converts CUDA APIs and Kernel builtins For AMD platforms : AMD Kaveri, Carrizo and Fiji For Nvidia platforms : requires Unified Memory and CUDA SDK 6.0 or newer 37
38 HIP Tools hipify : Tool to convert CUDA code to portable CPP. Converts CUDA APIs and kernel builtins. hipcc : Compiler driver that can be used to replace nvcc in existing CUDA code. hipcc will call nvcc or hcc depending on platform, and include appropriate platform-specific headers and libraries. hipconfig : Print HIP configuration (HIP_PATH, HIP_PLATFORM, CXX config flags, etc) hipexamine.sh : Script to scan directory, find all code, and report statistics on how much can be ported with HIP (and identify likely features not yet supported) hipconvertinplace.sh : to hipify all code files in the Cuda source directory. 38
39 VectorAdd Example using HIP #include <hip/hip_runtime.h> Convert: hipify vectoradd.cu > vectoradd.cpp Compile: hipcc vectoradd.cpp int main(void) { int numelements = 50000; size_t size = numelements * sizeof(float); // Allocate the host vectors float *A = (float *)malloc(size); float *B = (float *)malloc(size); float *C = (float *)malloc(size); // Initialize the host input vectors for (int i = 0; i < numelements; ++i) { A[i] = rand()/(float)rand_max; B[i] = rand()/(float)rand_max; // Free host memory free(a); free(b); free(c); hipdevicereset (); return 0; // Allocate the device vectors float *d_a, *d_b, *d_c; hipmalloc ((void **)&d_a, size); hipmalloc ((void **)&d_b, size); hipmalloc ((void **)&d_c, size); // Copy input vectors from host to device hipmemcpy (d_a, A, size, cudamemcpyhosttodevice); hipmemcpy(d_b, B, size, cudamemcpyhosttodevice); // Launch the Vector Add CUDA Kernel int threadsperblock = 256; int blockspergrid =(numelements + threadsperblock - 1) / threadsperblock; hiplaunchkernel(hip_kernel_name(vectoradd), dim3(blockspergrid), dim3(threadsperblock), 0, 0, d_a, d_b, d_c, numelements); // Copy the device result vector to host hipmemcpy (C, d_c, size, cudamemcpydevicetohost); // Free device global memory hipfree (d_a); hipfree(d_b); hipfree (d_c); // Device kernel global void vectoradd(hiplaunchparm lp, const float *A, const float *B, float *C, int numelements) { int i = hipblockdim_x * hipblockidx_x + hipthreadidx_x; if (i < numelements) { C[i] = A[i] + B[i]; 39
40 HIP Support Devices (hipsetdevice(), hipgetdeviceproperties(), etc) Memory management (hipmalloc(), hipmemcpy(), hipfree()) Streams (hipstreamcreate(), etc.)---under development Events (hipeventrecord(), hipeventelapsedtime(), etc.) Kernel launching CUDA-style kernel indexing Device-side math built-ins Error reporting (hipgetlasterror(), hipgeterrorstring()) Partially supported or Not supported HiP is evolving Check for regular updates here: ProfessionalCompute-Tools/HIP Textures Constant memory Dynamic parallelism Managed memory CUDA Libraries Graphics interoperation with OpenGL or Direct3D CUDA array, mipmappedarray and pitched memory CUDA Driver API 40
41 Methods of kernel Launch CUDA vectoradd<<<blockspergrid, threadsperblock>>>(d_a, d_b, d_c, numelements); OpenCL clenqueuendrangekernel(command_queue, kernel, 1, NULL, &global_item_size, &local_item_size, 0, NULL, NULL); HIP hiplaunchkernel(hip_kernel_name(vectoradd), dim3(blockspergrid), dim3(threadsperblock), 0, 0, d_a, d_b, d_c, numelements); HIP Porting Guide: 41
42 Terminology Comparison CUDA OpenCL HIP CUDA OpenCL HIP thread work-item thread threadidx.x get_local_id(0) hipthreadidx_x block work-group block blockidx.x get_group_id(0) hipblockidx_x grid NDRange grid blockdim.x get_local_size(0) hipblockdim_x global kernel global <<< >>> clenqueuendrangekernel hiplaunchkernel 42
43 Programming with Shared Memory GPU Shared Memory Visible to all threads within Block Fast on-chip memory Like user managed cache 43
44 Parallel Reduction global void reduce0(int *g_idata, int *g_odata) { extern shared int sdata[]; // each thread loads one element from global to shared mem unsigned int tid = threadidx.x; unsigned int i = blockidx.x*blockdim.x + threadidx.x; sdata[tid] = g_idata[i]; sdata[threadidx.x] threadidx.x threadidx.x threadidx.x blockidx.x=0 blockidx.x=1 blockidx.x= syncthreads(); // do reduction in shared mem for(unsigned int s=1; s < blockdim.x; s *= 2) { if (tid % (2*s) == 0) { sdata[tid] += sdata[tid + s]; syncthreads(); // write result for this block to global mem i f (tid == 0) g_odata[blockidx.x] = sdata[0]; g_idata[i]
45 Parallel Reduction global void reduce0(int *g_idata, int *g_odata) { extern shared int sdata[]; // each thread loads one element from global to shared mem unsigned int tid = threadidx.x; unsigned int i = blockidx.x*blockdim.x + threadidx.x; sdata[tid] = g_idata[i]; sdata[threadidx.x] threadidx.x threadidx.x threadidx.x blockidx.x=0 blockidx.x=1 blockidx.x= syncthreads(); // do reduction in shared mem for(unsigned int s=1; s < blockdim.x; s *= 2) { if (tid % (2*s) == 0) { sdata[tid] += sdata[tid + s]; syncthreads(); // write result for this block to global mem i f (tid == 0) g_odata[blockidx.x] = sdata[0]; sdata[threadidx.x] sdata[threadidx.x] sdata[threadidx.x]
46 Parallel Reduction global void reduce0(int *g_idata, int *g_odata) { extern shared int sdata[]; // each thread loads one element from global to shared mem unsigned int tid = threadidx.x; unsigned int i = blockidx.x*blockdim.x + threadidx.x; sdata[tid] = g_idata[i]; sdata[threadidx.x] syncthreads(); threadidx.x threadidx.x threadidx.x blockidx.x=0 blockidx.x=1 blockidx.x= // do reduction in shared mem for(unsigned int s=1; s < blockdim.x; s *= 2) { if (tid % (2*s) == 0) { sdata[tid] += sdata[tid + s]; syncthreads(); // write result for this block to global mem i f (tid == 0) g_odata[blockidx.x] = sdata[0]; g_odata[blockidx.x]
47 GPU Accelerated Libraries Application Fast Fourier Transforms Basic Linear Algebra cufft clfft cublas clblas CLBlast Sparse Matrix Dense and Sparse Direct Solvers cusparse clspmv cusolver SuperLU Parallel algorithms and data structures Random Number Generation Maths Library Thrust clpp curand ArrayFire 47
48 Achieved Speedup with respect to their original performance Success Stories 300 Applications SpeedUp with GPUs SpeedUp Incremental Risk Charge (IRC) [Hardware footprint reduction by 50X] Value at Risk (VaR) Digital Watermarking Principal component Analysis (PCA) Mark-to-Market Computation for IRS portfolio 2D-LBM for Fluid Flows 48
49 Concluding remarks Many technologies to develop applications on GPUs Some are easy to use (OpenACC) but may result in limited performance gains CUDA/Nvidia has good overall presence but proprietary technology may not be acceptable to some OpenCL offers flexibility of porting between HWs Little tedious way of programming compared to others HIP Good tool to convert and run a CUDA application on AMD device In general GPUs offer good performance benefit for massive data parallel applications 49
50 Thank You 50
Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series
Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca (Slides http://support.scinet.utoronto.ca/ northrup/westgrid CUDA.pdf) March 12, 2014
More informationIntroduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series
Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca March 13, 2014 Outline 1 Heterogeneous Computing 2 GPGPU - Overview Hardware Software
More informationRock em Graphic Cards
Rock em Graphic Cards Agnes Meyder 27.12.2013, 16:00 1 / 61 Layout Motivation Parallelism Old Standards OpenMPI OpenMP Accelerator Cards CUDA OpenCL OpenACC Hardware C++AMP The End 2 / 61 Layout Motivation
More informationCUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list
CUDA Programming Week 1. Basic Programming Concepts Materials are copied from the reference list G80/G92 Device SP: Streaming Processor (Thread Processors) SM: Streaming Multiprocessor 128 SP grouped into
More informationGPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique
GPU programming: CUDA basics Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr This lecture: CUDA programming We have seen some GPU architecture Now how to program it? 2 Outline
More informationHPCSE II. GPU programming and CUDA
HPCSE II GPU programming and CUDA What is a GPU? Specialized for compute-intensive, highly-parallel computation, i.e. graphic output Evolution pushed by gaming industry CPU: large die area for control
More informationHPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming
KFUPM HPC Workshop April 29-30 2015 Mohamed Mekias HPC Solutions Consultant Introduction to CUDA programming 1 Agenda GPU Architecture Overview Tools of the Trade Introduction to CUDA C Patterns of Parallel
More informationGPU Computing: Introduction to CUDA. Dr Paul Richmond
GPU Computing: Introduction to CUDA Dr Paul Richmond http://paulrichmond.shef.ac.uk This lecture CUDA Programming Model CUDA Device Code CUDA Host Code and Memory Management CUDA Compilation Programming
More informationHeterogeneous Computing
OpenCL Hwansoo Han Heterogeneous Computing Multiple, but heterogeneous multicores Use all available computing resources in system [AMD APU (Fusion)] Single core CPU, multicore CPU GPUs, DSPs Parallel programming
More informationProgramming with CUDA and OpenCL. Dana Schaa and Byunghyun Jang Northeastern University
Programming with CUDA and OpenCL Dana Schaa and Byunghyun Jang Northeastern University Tutorial Overview CUDA - Architecture and programming model - Strengths and limitations of the GPU - Example applications
More informationScientific Computing WS 2018/2019. Lecture 25. Jürgen Fuhrmann Lecture 25 Slide 1
Scientific Computing WS 2018/2019 Lecture 25 Jürgen Fuhrmann juergen.fuhrmann@wias-berlin.de Lecture 25 Slide 1 Lecture 25 Slide 2 SIMD Hardware: Graphics Processing Units ( GPU) [Source: computing.llnl.gov/tutorials]
More informationReal-time Graphics 9. GPGPU
Real-time Graphics 9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing
More informationECE 574 Cluster Computing Lecture 17
ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 6 April 2017 HW#8 will be posted Announcements HW#7 Power outage Pi Cluster Runaway jobs (tried
More informationAccelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include
3.1 Overview Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include GPUs (Graphics Processing Units) AMD/ATI
More informationScientific Computing WS 2017/2018. Lecture 27. Jürgen Fuhrmann Lecture 27 Slide 1
Scientific Computing WS 2017/2018 Lecture 27 Jürgen Fuhrmann juergen.fuhrmann@wias-berlin.de Lecture 27 Slide 1 Lecture 27 Slide 2 Why parallelization? Computers became faster and faster without that...
More informationLecture 8: GPU Programming. CSE599G1: Spring 2017
Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library
More informationCUDA C/C++ BASICS. NVIDIA Corporation
CUDA C/C++ BASICS NVIDIA Corporation What is CUDA? CUDA Architecture Expose GPU parallelism for general-purpose computing Retain performance CUDA C/C++ Based on industry-standard C/C++ Small set of extensions
More informationGPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh
GPU Programming EPCC The University of Edinburgh Contents NVIDIA CUDA C Proprietary interface to NVIDIA architecture CUDA Fortran Provided by PGI OpenCL Cross platform API 2 NVIDIA CUDA CUDA allows NVIDIA
More informationReal-time Graphics 9. GPGPU
9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing GPGPU general-purpose
More informationAn Introduction to GPU Computing and CUDA Architecture
An Introduction to GPU Computing and CUDA Architecture Sarah Tariq, NVIDIA Corporation GPU Computing GPU: Graphics Processing Unit Traditionally used for real-time rendering High computational density
More informationCUDA C/C++ Basics GTC 2012 Justin Luitjens, NVIDIA Corporation
CUDA C/C++ Basics GTC 2012 Justin Luitjens, NVIDIA Corporation What is CUDA? CUDA Platform Expose GPU computing for general purpose Retain performance CUDA C/C++ Based on industry-standard C/C++ Small
More informationScientific Computing WS 2017/2018. Lecture 28. Jürgen Fuhrmann Lecture 28 Slide 1
Scientific Computing WS 2017/2018 Lecture 28 Jürgen Fuhrmann juergen.fuhrmann@wias-berlin.de Lecture 28 Slide 1 SIMD Hardware: Graphics Processing Units ( GPU) [Source: computing.llnl.gov/tutorials] Principle
More informationGPU programming: CUDA. Sylvain Collange Inria Rennes Bretagne Atlantique
GPU programming: CUDA Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr This lecture: CUDA programming We have seen some GPU architecture Now how to program it? 2 Outline GPU
More informationCUDA C/C++ BASICS. NVIDIA Corporation
CUDA C/C++ BASICS NVIDIA Corporation What is CUDA? CUDA Architecture Expose GPU parallelism for general-purpose computing Retain performance CUDA C/C++ Based on industry-standard C/C++ Small set of extensions
More informationGPU 1. CSCI 4850/5850 High-Performance Computing Spring 2018
GPU 1 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning Objectives
More informationOpenCL. Matt Sellitto Dana Schaa Northeastern University NUCAR
OpenCL Matt Sellitto Dana Schaa Northeastern University NUCAR OpenCL Architecture Parallel computing for heterogenous devices CPUs, GPUs, other processors (Cell, DSPs, etc) Portable accelerated code Defined
More informationCUDA. Sathish Vadhiyar High Performance Computing
CUDA Sathish Vadhiyar High Performance Computing Hierarchical Parallelism Parallel computations arranged as grids One grid executes after another Grid consists of blocks Blocks assigned to SM. A single
More informationLinux Clusters Institute: Getting the Most from Your Linux Cluster
Linux Clusters Institute: Getting the Most from Your Linux Cluster Advanced Topics: GPU Clusters Mike Showerman mshow@ncsa.illinois.edu Our background Cluster models for HPC in 1996 I believe first compute
More informationIntroduction to CUDA Programming
Introduction to CUDA Programming Steve Lantz Cornell University Center for Advanced Computing October 30, 2013 Based on materials developed by CAC and TACC Outline Motivation for GPUs and CUDA Overview
More informationCSE 599 I Accelerated Computing Programming GPUS. Intro to CUDA C
CSE 599 I Accelerated Computing Programming GPUS Intro to CUDA C GPU Teaching Kit Accelerated Computing Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn the
More informationCUDA Exercises. CUDA Programming Model Lukas Cavigelli ETZ E 9 / ETZ D Integrated Systems Laboratory
CUDA Exercises CUDA Programming Model 05.05.2015 Lukas Cavigelli ETZ E 9 / ETZ D 61.1 Integrated Systems Laboratory Exercises 1. Enumerate GPUs 2. Hello World CUDA kernel 3. Vectors Addition Threads and
More informationBasic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Basic Elements of CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of
More informationParallel Accelerators
Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon
More informationOpenCL. Computation on HybriLIT Brief introduction and getting started
OpenCL Computation on HybriLIT Brief introduction and getting started Alexander Ayriyan Laboratory of Information Technologies Joint Institute for Nuclear Research 05.09.2014 (Friday) Tutorial in frame
More informationJosef Pelikán, Jan Horáček CGG MFF UK Praha
GPGPU and CUDA 2012-2018 Josef Pelikán, Jan Horáček CGG MFF UK Praha pepca@cgg.mff.cuni.cz http://cgg.mff.cuni.cz/~pepca/ 1 / 41 Content advances in hardware multi-core vs. many-core general computing
More informationLecture 11: GPU programming
Lecture 11: GPU programming David Bindel 4 Oct 2011 Logistics Matrix multiply results are ready Summary on assignments page My version (and writeup) on CMS HW 2 due Thursday Still working on project 2!
More informationLecture 6b Introduction of CUDA programming
CS075 1896 1920 1987 2006 Lecture 6b Introduction of CUDA programming 0 1 0, What is CUDA? CUDA Architecture Expose GPU parallelism for general-purpose computing Retain performance CUDA C/C++ Based on
More informationIntroduction to CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator
Introduction to CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator What is CUDA? Programming language? Compiler? Classic car? Beer? Coffee? CUDA Parallel Computing Platform www.nvidia.com/getcuda Programming
More informationAdvanced OpenMP. Other threading APIs
Advanced OpenMP Other threading APIs What s wrong with OpenMP? OpenMP is designed for programs where you want a fixed number of threads, and you always want the threads to be consuming CPU cycles. - cannot
More informationParallel Accelerators
Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon
More informationGPU Computing with CUDA
GPU Computing with CUDA Hands-on: CUDA Profiling, Thrust Dan Melanz & Dan Negrut Simulation-Based Engineering Lab Wisconsin Applied Computing Center Department of Mechanical Engineering University of Wisconsin-Madison
More informationPPAR: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique PPAR
PPAR: CUDA basics Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr http://www.irisa.fr/alf/collange/ PPAR - 2018 This lecture: CUDA programming We have seen some GPU architecture
More informationMasterpraktikum Scientific Computing
Masterpraktikum Scientific Computing High-Performance Computing Michael Bader Alexander Heinecke Technische Universität München, Germany Outline Intel Cilk Plus OpenCL Übung, October 7, 2012 2 Intel Cilk
More informationMassively Parallel Algorithms
Massively Parallel Algorithms Introduction to CUDA & Many Fundamental Concepts of Parallel Programming G. Zachmann University of Bremen, Germany cgvr.cs.uni-bremen.de Hybrid/Heterogeneous Computation/Architecture
More information$ cd ex $ cd 1.Enumerate_GPUs $ make $./enum_gpu.x. Enter in the application folder Compile the source code Launch the application
$ cd ex $ cd 1.Enumerate_GPUs $ make $./enum_gpu.x Enter in the application folder Compile the source code Launch the application --- General Information for device 0 --- Name: xxxx Compute capability:
More informationCUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan
CUDA Workshop High Performance GPU computing EXEBIT- 2014 Karthikeyan CPU vs GPU CPU Very fast, serial, Low Latency GPU Slow, massively parallel, High Throughput Play Demonstration Compute Unified Device
More informationGPU programming. Dr. Bernhard Kainz
GPU programming Dr. Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages GPU programming paradigms Pitfalls and best practice Reduction and tiling
More informationDon t reinvent the wheel. BLAS LAPACK Intel Math Kernel Library
Libraries Don t reinvent the wheel. Specialized math libraries are likely faster. BLAS: Basic Linear Algebra Subprograms LAPACK: Linear Algebra Package (uses BLAS) http://www.netlib.org/lapack/ to download
More informationIntroduction to OpenACC. 16 May 2013
Introduction to OpenACC 16 May 2013 GPUs Reaching Broader Set of Developers 1,000,000 s 100,000 s Early Adopters Research Universities Supercomputing Centers Oil & Gas CAE CFD Finance Rendering Data Analytics
More informationGPU Programming Introduction
GPU Programming Introduction DR. CHRISTOPH ANGERER, NVIDIA AGENDA Introduction to Heterogeneous Computing Using Accelerated Libraries GPU Programming Languages Introduction to CUDA Lunch What is Heterogeneous
More informationNeil Trevett Vice President, NVIDIA OpenCL Chair Khronos President
4 th Annual Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President Copyright Khronos Group, 2009 - Page 1 CPUs Multiple cores driving performance increases Emerging Intersection GPUs Increasingly
More informationPerforming Reductions in OpenCL
Performing Reductions in OpenCL Mike Bailey mjb@cs.oregonstate.edu opencl.reduction.pptx Recall the OpenCL Model Kernel Global Constant Local Local Local Local Work- ItemWork- ItemWork- Item Here s the
More informationGPU Programming Using CUDA
GPU Programming Using CUDA Michael J. Schnieders Depts. of Biomedical Engineering & Biochemistry The University of Iowa & Gregory G. Howes Department of Physics and Astronomy The University of Iowa Iowa
More informationBasics of CADA Programming - CUDA 4.0 and newer
Basics of CADA Programming - CUDA 4.0 and newer Feb 19, 2013 Outline CUDA basics Extension of C Single GPU programming Single node multi-gpus programing A brief introduction on the tools Jacket CUDA FORTRAN
More informationGPU Architecture and Programming. Andrei Doncescu inspired by NVIDIA
GPU Architecture and Programming Andrei Doncescu inspired by NVIDIA Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions are executed
More informationVector Addition on the Device: main()
Vector Addition on the Device: main() #define N 512 int main(void) { int *a, *b, *c; // host copies of a, b, c int *d_a, *d_b, *d_c; // device copies of a, b, c int size = N * sizeof(int); // Alloc space
More informationProgrammable Graphics Hardware (GPU) A Primer
Programmable Graphics Hardware (GPU) A Primer Klaus Mueller Stony Brook University Computer Science Department Parallel Computing Explained video Parallel Computing Explained Any questions? Parallelism
More informationCUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA
CUDA PROGRAMMING MODEL Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA: COMMON UNIFIED DEVICE ARCHITECTURE Parallel computing architecture and programming model GPU Computing Application Includes
More informationOpenCL on the GPU. San Jose, CA September 30, Neil Trevett and Cyril Zeller, NVIDIA
OpenCL on the GPU San Jose, CA September 30, 2009 Neil Trevett and Cyril Zeller, NVIDIA Welcome to the OpenCL Tutorial! Khronos and industry perspective on OpenCL Neil Trevett Khronos Group President OpenCL
More informationIntroduc)on to GPU Programming
Introduc)on to GPU Programming Mubashir Adnan Qureshi h3p://www.ncsa.illinois.edu/people/kindr/projects/hpca/files/singapore_p1.pdf h3p://developer.download.nvidia.com/cuda/training/nvidia_gpu_compu)ng_webinars_cuda_memory_op)miza)on.pdf
More informationGPGPU IGAD 2014/2015. Lecture 1. Jacco Bikker
GPGPU IGAD 2014/2015 Lecture 1 Jacco Bikker Today: Course introduction GPGPU background Getting started Assignment Introduction GPU History History 3DO-FZ1 console 1991 History NVidia NV-1 (Diamond Edge
More informationUniversity of Bielefeld
Geistes-, Natur-, Sozial- und Technikwissenschaften gemeinsam unter einem Dach Introduction to GPU Programming using CUDA Olaf Kaczmarek University of Bielefeld STRONGnet Summerschool 2011 ZIF Bielefeld
More informationGPU Programming with CUDA. Pedro Velho
GPU Programming with CUDA Pedro Velho Meeting the audience! How many of you used concurrent programming before? How many threads? How many already used CUDA? Introduction from games to science 1 2 Architecture
More informationLecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1
Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei
More informationA Tutorial on CUDA Performance Optimizations
A Tutorial on CUDA Performance Optimizations Amit Kalele Prasad Pawar Parallelization & Optimization CoE TCS Pune 1 Outline Overview of GPU architecture Optimization Part I Block and Grid size Shared memory
More informationECE 574 Cluster Computing Lecture 15
ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements
More informationBasic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Basic Elements of CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied
More informationECE 574 Cluster Computing Lecture 17
ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 March 2019 HW#8 (CUDA) posted. Project topics due. Announcements 1 CUDA installing On Linux
More informationGPU CUDA Programming
GPU CUDA Programming 이정근 (Jeong-Gun Lee) 한림대학교컴퓨터공학과, 임베디드 SoC 연구실 www.onchip.net Email: Jeonggun.Lee@hallym.ac.kr ALTERA JOINT LAB Introduction 차례 Multicore/Manycore and GPU GPU on Medical Applications
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna14/ [ 10 ] GPU and CUDA Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture and Performance
More informationScientific Computing WS 2018/2019. Lecture 24. Jürgen Fuhrmann Lecture 24 Slide 1
Scientific Computing WS 2018/2019 Lecture 24 Jürgen Fuhrmann juergen.fuhrmann@wias-berlin.de Lecture 24 Slide 1 Lecture 24 Slide 2 TOP 500 2018 rank 1-6 Based on linpack benchmark: solution of dense linear
More informationOpenCL. Dr. David Brayford, LRZ, PRACE PATC: Intel MIC & GPU Programming Workshop
OpenCL Dr. David Brayford, LRZ, brayford@lrz.de PRACE PATC: Intel MIC & GPU Programming Workshop 1 Open Computing Language Open, royalty-free standard C-language extension For cross-platform, parallel
More informationNeil Trevett Vice President, NVIDIA OpenCL Chair Khronos President. Copyright Khronos Group, Page 1
Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President Copyright Khronos Group, 2009 - Page 1 Introduction and aims of OpenCL - Neil Trevett, NVIDIA OpenCL Specification walkthrough - Mike
More informationPractical Introduction to CUDA and GPU
Practical Introduction to CUDA and GPU Charlie Tang Centre for Theoretical Neuroscience October 9, 2009 Overview CUDA - stands for Compute Unified Device Architecture Introduced Nov. 2006, a parallel computing
More informationCS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)
CS 470 Spring 2016 Mike Lam, Professor Other Architectures (with an aside on linear algebra) Parallel Systems Shared memory (uniform global address space) Primary story: make faster computers Programming
More informationOpenCL The Open Standard for Heterogeneous Parallel Programming
OpenCL The Open Standard for Heterogeneous Parallel Programming March 2009 Copyright Khronos Group, 2009 - Page 1 Close-to-the-Silicon Standards Khronos creates Foundation-Level acceleration APIs - Needed
More informationOverview: Graphics Processing Units
advent of GPUs GPU architecture Overview: Graphics Processing Units the NVIDIA Fermi processor the CUDA programming model simple example, threads organization, memory model case study: matrix multiply
More informationCOSC 6374 Parallel Computations Introduction to CUDA
COSC 6374 Parallel Computations Introduction to CUDA Edgar Gabriel Fall 2014 Disclaimer Material for this lecture has been adopted based on various sources Matt Heavener, CS, State Univ. of NY at Buffalo
More informationGPU COMPUTING. Ana Lucia Varbanescu (UvA)
GPU COMPUTING Ana Lucia Varbanescu (UvA) 2 Graphics in 1980 3 Graphics in 2000 4 Graphics in 2015 GPUs in movies 5 From Ariel in Little Mermaid to Brave So 6 GPUs are a steady market Gaming CAD-like activities
More informationTechnology for a better society. hetcomp.com
Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction
More informationCUDA Parallel Programming Model. Scalable Parallel Programming with CUDA
CUDA Parallel Programming Model Scalable Parallel Programming with CUDA Some Design Goals Scale to 100s of cores, 1000s of parallel threads Let programmers focus on parallel algorithms not mechanics of
More informationScientific discovery, analysis and prediction made possible through high performance computing.
Scientific discovery, analysis and prediction made possible through high performance computing. An Introduction to GPGPU Programming Bob Torgerson Arctic Region Supercomputing Center November 21 st, 2013
More informationIntroduction to GPU Computing Junjie Lai, NVIDIA Corporation
Introduction to GPU Computing Junjie Lai, NVIDIA Corporation Outline Evolution of GPU Computing Heterogeneous Computing CUDA Execution Model & Walkthrough of Hello World Walkthrough : 1D Stencil Once upon
More informationData Parallelism. CSCI 5828: Foundations of Software Engineering Lecture 28 12/01/2016
Data Parallelism CSCI 5828: Foundations of Software Engineering Lecture 28 12/01/2016 1 Goals Cover the material in Chapter 7 of Seven Concurrency Models in Seven Weeks by Paul Butcher Data Parallelism
More informationWebCL Overview and Roadmap
Copyright Khronos Group, 2011 - Page 1 WebCL Overview and Roadmap Tasneem Brutch Chair WebCL Working Group Samsung Electronics Copyright Khronos Group, 2011 - Page 2 WebCL Motivation Enable high performance
More informationCUDA C Programming Mark Harris NVIDIA Corporation
CUDA C Programming Mark Harris NVIDIA Corporation Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory Models Programming Environment
More informationCS/EE 217 GPU Architecture and Parallel Programming. Lecture 22: Introduction to OpenCL
CS/EE 217 GPU Architecture and Parallel Programming Lecture 22: Introduction to OpenCL Objective To Understand the OpenCL programming model basic concepts and data types OpenCL application programming
More informationOptimizing Parallel Reduction in CUDA
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf Parallel Reduction Tree-based approach used within each
More informationIntroduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model
Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Dirk Ribbrock Faculty of Mathematics, TU dortmund 2016 Table of Contents Why parallel
More informationINTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro
INTRODUCTION TO GPU COMPUTING WITH CUDA Topi Siro 19.10.2015 OUTLINE PART I - Tue 20.10 10-12 What is GPU computing? What is CUDA? Running GPU jobs on Triton PART II - Thu 22.10 10-12 Using libraries Different
More informationCard Sizes. Tesla K40: 2880 processors; 12 GB memory
Card Sizes Tesla K40: 2880 processors; 12 GB memory Data bigger than grid Maximum grid sizes Compute capability 1.0, 1D and 2D grids supported Compute capability 2, 3, 3D grids too. Grid sizes: 65,535
More informationParallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer
Parallel Programming and Debugging with CUDA C Geoff Gerfin Sr. System Software Engineer CUDA - NVIDIA s Architecture for GPU Computing Broad Adoption Over 250M installed CUDA-enabled GPUs GPU Computing
More informationCUDA Parallel Programming Model Michael Garland
CUDA Parallel Programming Model Michael Garland NVIDIA Research Some Design Goals Scale to 100s of cores, 1000s of parallel threads Let programmers focus on parallel algorithms not mechanics of a parallel
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationGPGPU COMPUTE ON AMD. Udeepta Bordoloi April 6, 2011
GPGPU COMPUTE ON AMD Udeepta Bordoloi April 6, 2011 WHY USE GPU COMPUTE CPU: scalar processing + Latency + Optimized for sequential and branching algorithms + Runs existing applications very well - Throughput
More informationTesla Architecture, CUDA and Optimization Strategies
Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization
More informationLearn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh
Learn CUDA in an Afternoon Alan Gray EPCC The University of Edinburgh Overview Introduction to CUDA Practical Exercise 1: Getting started with CUDA GPU Optimisation Practical Exercise 2: Optimising a CUDA
More informationHigh-Performance Computing Using GPUs
High-Performance Computing Using GPUs Luca Caucci caucci@email.arizona.edu Center for Gamma-Ray Imaging November 7, 2012 Outline Slide 1 of 27 Why GPUs? What is CUDA? The CUDA programming model Anatomy
More informationGPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum
GPU & High Performance Computing (by NVIDIA) CUDA Compute Unified Device Architecture 29.02.2008 Florian Schornbaum GPU Computing Performance In the last few years the GPU has evolved into an absolute
More informationIntroduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research
Introduction to CUDA CME343 / ME339 18 May 2011 James Balfour [ jbalfour@nvidia.com] NVIDIA Research CUDA Programing system for machines with GPUs Programming Language Compilers Runtime Environments Drivers
More information