Accelerate with GPUs Harnessing GPGPUs with trending technologies

Size: px
Start display at page:

Download "Accelerate with GPUs Harnessing GPGPUs with trending technologies"

Transcription

1 Accelerate with GPUs Harnessing GPGPUs with trending technologies Anubhav Jain and Amit Kalele Parallelization and Optimization CoE Tata Consultancy Services Ltd. Copyright 2016 Tata Consultancy Services Limited 1

2 Outline Overview of GPUs HW architecture Parallelism, CPU/GPU model Programming technologies Introduction to GPU programming, Tools, SDK etc. OpenACC directive based approach Introduction to CUDA C Nvidia specific approach OpenCL generic approach HIP AMD specific approach Advance programming example CUDA approach 2

3 Basic Terminology Host CPU CPU Memory (Host Memory) Device GPU GPU Memory (Device Memory) 3

4 CPU & GPU Architecture Control ALU ALU ALU ALU Cache DRAM CPU Low Latency/Throughput Less ALUs Powerful ALUs DRAM GPU High Latency/Throughput More ALUs Simpler ALUs 4

5 CPU Vs GPU Parallelism CPUs use task/data parallelism Multiple tasks map to multiple threads Tasks run different instructions 10s of relatively heavyweight threads run on 10s of cores Each thread has to be individually programmed GPUs use data parallelism SIMT model (Single Instruction Multiple Thread) Same instruction on different data 10,000s of lightweight threads on 100s of cores Programming done for batches of threads (e.g. block of threads) 5

6 CPU-GPU Model CPU GPU Memory PCIe-x16 Memory CPUs Host or Master, offloads work to GPU GPUs Device or Co processor, worker 6

7 GPGPU Programming Technologies AMD s HIP OpenACC OpenCL CUDA GPGPU Terminology GPGPU.org 7

8 GPU Streaming Multiprocessor Nvidia Tesla K40 15 SMX Total Cores: Cores/SMX 8

9 GPU Memory Model (CUDA) Local Private to Each thread Shared Visible to all threads within Block Fast on-chip memory Like user managed cache Global All threads within Grid Texture/Constant All threads within Grid 9

10 Simple Program Workflow Host Device CPU CPU Memory PCI Bus CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE 1. Copying input data from host to device 1 CORE CORE CORE CORE GPU Memory 10

11 Simple Program Workflow Host CPU CPU Memory 2 PCI Bus Device CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE 1. Copying input data from host to device 2. Load device code and execute 1 CORE CORE CORE CORE GPU Memory 11

12 Simple Program Workflow Host CPU CPU Memory 2 PCI Bus Device CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE 1. Copying input data from host to device 2. Load device code and execute 3. Copying results back to host 1 3 CORE CORE CORE CORE GPU Memory 12

13 OpenACC Open Accelerators: Developed by Cray, CAPS, Nvidia and PGI PGI s OpenACC supports Nvidia Tesla, AMD Radeon, Intel & AMD multicore processors Supports C, C++, Fortran PGI Compilers: pgcc, pgc++ and pgfortran Directive Based Approach 13

14 OpenACC: vectoradd A B [0] [1] [2].. [31] [32] [33] [34].. [63] [64] [65] [66].. [101] [0] [1] [2].. [31] [32] [33] [34].. [63] [64] [65] [66].. [101] Parallel Vector Addition C [0] [1] [2].. [31] [32] [33] [34].. [63] [64] [65] [66].. [101] #pragma acc kernels loop copyin(a[0:n],b[0:n]) copyout(c[0:n]) for( int i = 0; i < N; ++i ) c[i] = a[i] + b[i]; Allocate a,b,c on device Copy a,b from host to device Copy c from device to host #pragma acc parallel loop for(i=0; i<n; ++i) c[i] = a[i] + b[i]; 14

15 OpenACC: vectoradd The OpenACC execution model has three levels: gang, worker, and vector gang==block, worker==warp, and vector==threads #pragma acc kernels loop gang(100), vector(128) for( int i = 0; i < N; ++i ) c[i] = a[i] + b[i]; Spawn 100 CUDA Blocks Spawn 128 Threads/Block Helps in Fine tuning 15

16 OpenACC: vectoradd pgcc -ta = (target accelerator) vectoradd.c -ta=tesla[:cuda7.0 cuda7.5 cuda8.0 ] -ta=radeon -ta=host -ta=multicore OpenACC Learning reference: 16

17 CUDA Uses CUDA-enabled Nvidia GPUs for GPGPU Processing Developed by Nvidia Supports C, C++, PGI CUDA Fortran Third party wrappers are also available for Java, Perl, Python, MATLAB etc. CUDA Compiler: nvcc Splits source file into host and device components CUDA complier handles device code Standard host compiler handles host code 17

18 CUDA Device code /*Device Kernel*/ global void kernel(... ) { Host code int main() { /*Allocate space on device*/ cudamalloc( ); /*Copying data from host to device*/ cudamemcpy(, cudamemcpyhosttodevice); /*Launch of GPU kernel*/ kernel<<<gridsize, blocksize>>>( ); /*Copying data from device to host*/ cudamemcpy(, cudamemcpydevicetohost); /*Cleanup*/ cudafree( ); Serial region (Host) Parallel region (Device) Serial region (Host) return 0; 18

19 VectorAdd Example using CUDA #include <cuda_runtime.h> int main(void) { int numelements = 50000; size_t size = numelements * sizeof(float); // Allocate the host vectors float *A = (float *)malloc(size); float *B = (float *)malloc(size); float *C = (float *)malloc(size); // Initialize the host input vectors for (int i = 0; i < numelements; ++i) { A[i] = rand()/(float)rand_max; B[i] = rand()/(float)rand_max; // Device Calls here // Free host memory free(a); free(b); free(c); // Allocate the device vectors float *d_a, *d_b, *d_c; cudamalloc((void **)&d_a, size); cudamalloc((void **)&d_b, size); cudamalloc((void **)&d_c, size); // Copy input vectors from host to device cudamemcpy(d_a, A, size, cudamemcpyhosttodevice); cudamemcpy(d_b, B, size, cudamemcpyhosttodevice); // Launch the Vector Add CUDA Kernel int threadsperblock = 256; int blockspergrid =(numelements + threadsperblock - 1) / threadsperblock; vectoradd<<<blockspergrid, threadsperblock>>>(d_a, d_b, d_c, numelements); // Copy the device result vector to host cudamemcpy(c, d_c, size, cudamemcpydevicetohost); // Free device global memory cudafree(d_a); cudafree(d_b); cudafree(d_c); // Device kernel global void vectoradd(const float *A, const float *B, float *C, int numelements) { int i = blockdim.x * blockidx.x + threadidx.x; if (i < numelements) { C[i] = A[i] + B[i]; cudadevicereset(); return 0; 19

20 VectorAdd Example using CUDA #include <cuda_runtime.h> int main(void) { int numelements = 50000; size_t size = numelements * sizeof(float); // Allocate the host vectors float *A = (float *)malloc(size); float *B = (float *)malloc(size); float *C = (float *)malloc(size); // Initialize the host input vectors for (int i = 0; i < numelements; ++i) { A[i] = rand()/(float)rand_max; B[i] = rand()/(float)rand_max; // Device Calls here Host Side data allocation & initialization Device Calls : NEXT SLIDE // Free host memory free(a); free(b); free(c); cudadevicereset(); return 0; 20

21 VectorAdd Example using CUDA // Allocate the device vectors float *d_a, *d_b, *d_c; cudamalloc((void **)&d_a, size); cudamalloc((void **)&d_b, size); cudamalloc((void **)&d_c, size); // Copy input vectors from host to device cudamemcpy(d_a, A, size, cudamemcpyhosttodevice); cudamemcpy(d_b, B, size, cudamemcpyhosttodevice); // Launch the Vector Add CUDA Kernel int threadsperblock = 256; int blockspergrid =(numelements + threadsperblock - 1) / threadsperblock; vectoradd<<<blockspergrid, threadsperblock>>>(d_a, d_b, d_c, numelements); // Copy the device result vector to host cudamemcpy(c, d_c, size, cudamemcpydevicetohost); Device side arrays allocation Copying data from host to device Calling kernel : NEXT SLIDE Copying results from host to device // Free device global memory cudafree(d_a); cudafree(d_b); cudafree(d_c); Free device side arrays 21

22 VectorAdd Example using CUDA // Device kernel global void vectoradd(const float *A, const float *B, float *C, int numelements) { int i = blockdim.x * blockidx.x + threadidx.x; if (i < numelements) { C[i] = A[i] + B[i]; 22

23 Thread Hierarchy thread block grid Kernel is executed as grid of blocks of threads Nvidia Tesla K20 (Kepler) architecture 23

24 VectorAdd Example using CUDA threadidx.x threadidx.x threadidx.x // Device kernel global void vectoradd(const float *A, const float *B, float *C, int numelements) { int i = blockdim.x * blockidx.x + threadidx.x; blockidx.x=0 blockidx.x=1 blockidx.x= if (i < numelements) { A [0] [1] [2].. [31] [32] [33] [34].. [63] [64] [65] [66].. [101] + C[i] = A[i] + B[i]; B [0] [1] [2].. [31] [32] [33] [34].. [63] [64] [65] [66].. [101] = C [0] [1] [2].. [31] [32] [33] [34].. [63] [64] [65] [66].. [101] Device Memory 24

25 VectorAdd Example using CUDA Compile: nvcc vectoradd.cu -arch=sm_30 Compute capability of Nvidia Cards: 1.0, 2.0, 3.0, 3.5 Check here: cudagetdeviceproperties returns major and minor number (compute capability) CUDA SDK Tools: nvvp : Nvidia Visual Profiler nvprof : Command-line based Profiler cuda-gdb : Nvidia CUDA debugger CUDA Learning reference: 25

26 Nvidia Visual Profiler : nvvp Analysis View Timeline View GPU Details View PC Sampling View 26

27 OpenCL CPUs, GPUs, DSPs, FPGAs and other processors and hardware accelerators Implementations are available from AMD, Apple, IBM, Altera, Intel, Nvidia, Qualcomm, Samsung etc. Supports C, C++ Third party APIs for Python, Java,.NET etc. 27

28 Program Workflow CUDA OpenCL Create memory buffers on device Copy data host to device Read kernel Offline/Online Create OpenCL Context Create Command Queue Create memory buffers on device Copy results back to Host Execute Kernel Create OpenCL kernel Build the program Create program from kernel source Copy data host to device Cleanup device memory Set kernel arguments Execute Kernel Copy results back to Host Cleanup device memory 28

29 VectorAdd Example using OpenCL #include <CL/cl.h> int main(void) { const int LIST_SIZE = (1024*1024); int *A = (int*)malloc(sizeof(int)*list_size); int *B = (int*)malloc(sizeof(int)*list_size); for(i = 0; i < LIST_SIZE; i++) { A[i] = i; B[i] = LIST_SIZE - i; // Load the kernel source code into the array source_str fp = fopen("vector_add_kernel.cl", "r"); source_str = (char*)malloc(max_source_size); source_size = fread( source_str, 1, MAX_SOURCE_SIZE, fp); // Get platform and device information (Hidden Steps) // Create an OpenCL context cl_context context = clcreatecontext( NULL, 1, &device_id, NULL, NULL, &ret); // Create a command queue cl_command_queue command_queue = clcreatecommandqueue( context, device_id, 0, &ret); // Create memory buffers on the device for each vector cl_mem a_mem_obj = clcreatebuffer( context, CL_MEM_READ_ONLY, LIST_SIZE * sizeof(int), NULL, &ret); // Build the program ret = clbuildprogram( program, 1, &device_id, NULL, NULL, NULL); // Create the OpenCL kernel cl_kernel kernel = clcreatekernel( program, "vector_add", &ret); // Set the arguments of the kernel ret = clsetkernelarg( kernel, 0, sizeof(cl_mem), (void *)&a_mem_obj); ret = clsetkernelarg( kernel, 1, sizeof(cl_mem), (void *)&b_mem_obj); ret = clsetkernelarg( kernel, 2, sizeof(cl_mem), (void *)&c_mem_obj); // Execute the OpenCL kernel on the list size_t global_item_size = LIST_SIZE; // Process the entire lists size_t local_item_size = 64; // Divide work items into groups of 64 ret = clenqueuendrangekernel( command_queue, kernel, 1, NULL, &global_item_size, &local_item_size, 0, NULL, NULL); // Free host memory free(a); free(b); free(c); return 0; // Copy the lists A and B to their respective memory buffers ret = clenqueuewritebuffer( command_queue, a_mem_obj, CL_TRUE, 0, LIST_SIZE * sizeof(int), A, 0, NULL, NULL); // Create a program from the kernel source cl_program program = clcreateprogramwithsource( context, 1, (const char **)&source_str, (const size_t *)&source_size, &ret); // Read the memory buffer C on the device to the local variable C int *C = (int*)malloc(sizeof(int)*list_size); ret = clenqueuereadbuffer( command_queue, c_mem_obj, CL_TRUE, 0, LIST_SIZE * sizeof(int), C, 0, NULL, NULL); // Clean up ret = clflush( command_queue); ret = clreleasekernel( kernel); ret = clreleaseprogram( program); ret = clreleasememobject( a_mem_obj); ret = clreleasecommandqueue( command_queue); ret = clreleasecontext( context); 29

30 VectorAdd Example using OpenCL #include <CL/cl.h> int main(void) { const int LIST_SIZE = (1024*1024); int *A = (int*)malloc(sizeof(int)*list_size); int *B = (int*)malloc(sizeof(int)*list_size); for(i = 0; i < LIST_SIZE; i++) { A[i] = i; B[i] = LIST_SIZE - i; // Load the kernel source code into the array source_str fp = fopen("vector_add_kernel.cl", "r"); source_str = (char*)malloc(max_source_size); source_size = fread( source_str, 1, MAX_SOURCE_SIZE, fp); // Get platform and device information (Hidden Steps) // Create an OpenCL context cl_context context = clcreatecontext( NULL, 1, &device_id, NULL, NULL, &ret); // Create a command queue cl_command_queue command_queue = clcreatecommandqueue(context, device_id, 0, &ret); // Create memory buffers on the device for each vector cl_mem a_mem_obj = clcreatebuffer(context, CL_MEM_READ_ONLY, LIST_SIZE * sizeof(int), NULL, &ret); // Build the program ret = clbuildprogram(program, 1, &device_id, NULL, NULL, NULL); // Device kernel // Create the kernel OpenCL void kernel vector_add( global const int *A, cl_kernel kernel global = clcreatekernel(program, const int *B, global int "vector_add", *C) { &ret); // Set the arguments // get threadid of the kernel ret = clsetkernelarg(kernel, int i = get_global_id(0); 0, sizeof(cl_mem), (void *)&a_mem_obj); ret = clsetkernelarg(kernel, 1, sizeof(cl_mem), (void *)&b_mem_obj); ret = clsetkernelarg(kernel, // Do the operation 2, sizeof(cl_mem), (void *)&c_mem_obj); C[i] = A[i] + B[i]; // Execute the OpenCL kernel on the list size_t global_item_size = LIST_SIZE; // Process the entire lists size_t local_item_size = 64; // Divide work items into groups of 64 ret = clenqueuendrangekernel(command_queue, kernel, 1, NULL, &global_item_size, &local_item_size, 0, NULL, NULL); // Free host memory free(a); free(b); free(c); return 0; // Copy the lists A and B to their respective memory buffers ret = clenqueuewritebuffer(command_queue, a_mem_obj, CL_TRUE, 0, LIST_SIZE * sizeof(int), A, 0, NULL, NULL); // Create a program from the kernel source cl_program program = clcreateprogramwithsource(context, 1, (const char **)&source_str, (const size_t *)&source_size, &ret); // Read the memory buffer C on the device to the local variable C int *C = (int*)malloc(sizeof(int)*list_size); ret = clenqueuereadbuffer(command_queue, c_mem_obj, CL_TRUE, 0, LIST_SIZE * sizeof(int), C, 0, NULL, NULL); // Clean up ret = clflush(command_queue); ret = clreleasekernel(kernel); ret = clreleaseprogram(program); ret = clreleasememobject(a_mem_obj); ret = clreleasecommandqueue(command_queue); ret = clreleasecontext(context); 30

31 OpenCL Online/Offline kernel compilation Offline: Kernel binary is read in by the host code clcreateprogramwithbinary() Online: Kernel source file is read in by the host code clcreateprogramwithsource() 31

32 OpenCL Tools Compile: gcc vectoradd.c -I /usr/local/cuda-7.0/include -L /usr/local/cuda-7.0/lib64 lopencl Tools: CodeXL Powerful Debugging, Profiling & Analysis ( LTPV: Light Temporal Performance Viewer ( AMD gdebugger: Debugger & Memory Analyzer OpenCL Learning reference :

33 Terminology Comparison CUDA OpenCL CUDA OpenCL thread work-item threadidx.x get_local_id(0) block work-group blockidx.x get_group_id(0) grid NDRange blockdim.x get_local_size(0) global kernel <<< >>> clenqueuendrangekernel 33

34 Memory Model Comparison OpenCL CUDA 34

35 Memory Model Comparison CUDA OpenCL Global Memory Local Memory Global Memory Texture Memory Constant Memory Shared Memory Registers Constant Memory Local Memory Private Memory 35

36 HIP Heterogeneous-compute Interface for Portability Developed by AMD, allow developers to convert CUDA code to common C++ Supports C, C++ The resulting code can run through either CUDA NVCC or AMD HCC compilers Provides more choice in hardware and development tools. 36

37 HIP Code Conversion Workflow Hipify Tool to convert CUDA code to portable CPP Converts CUDA APIs and Kernel builtins For AMD platforms : AMD Kaveri, Carrizo and Fiji For Nvidia platforms : requires Unified Memory and CUDA SDK 6.0 or newer 37

38 HIP Tools hipify : Tool to convert CUDA code to portable CPP. Converts CUDA APIs and kernel builtins. hipcc : Compiler driver that can be used to replace nvcc in existing CUDA code. hipcc will call nvcc or hcc depending on platform, and include appropriate platform-specific headers and libraries. hipconfig : Print HIP configuration (HIP_PATH, HIP_PLATFORM, CXX config flags, etc) hipexamine.sh : Script to scan directory, find all code, and report statistics on how much can be ported with HIP (and identify likely features not yet supported) hipconvertinplace.sh : to hipify all code files in the Cuda source directory. 38

39 VectorAdd Example using HIP #include <hip/hip_runtime.h> Convert: hipify vectoradd.cu > vectoradd.cpp Compile: hipcc vectoradd.cpp int main(void) { int numelements = 50000; size_t size = numelements * sizeof(float); // Allocate the host vectors float *A = (float *)malloc(size); float *B = (float *)malloc(size); float *C = (float *)malloc(size); // Initialize the host input vectors for (int i = 0; i < numelements; ++i) { A[i] = rand()/(float)rand_max; B[i] = rand()/(float)rand_max; // Free host memory free(a); free(b); free(c); hipdevicereset (); return 0; // Allocate the device vectors float *d_a, *d_b, *d_c; hipmalloc ((void **)&d_a, size); hipmalloc ((void **)&d_b, size); hipmalloc ((void **)&d_c, size); // Copy input vectors from host to device hipmemcpy (d_a, A, size, cudamemcpyhosttodevice); hipmemcpy(d_b, B, size, cudamemcpyhosttodevice); // Launch the Vector Add CUDA Kernel int threadsperblock = 256; int blockspergrid =(numelements + threadsperblock - 1) / threadsperblock; hiplaunchkernel(hip_kernel_name(vectoradd), dim3(blockspergrid), dim3(threadsperblock), 0, 0, d_a, d_b, d_c, numelements); // Copy the device result vector to host hipmemcpy (C, d_c, size, cudamemcpydevicetohost); // Free device global memory hipfree (d_a); hipfree(d_b); hipfree (d_c); // Device kernel global void vectoradd(hiplaunchparm lp, const float *A, const float *B, float *C, int numelements) { int i = hipblockdim_x * hipblockidx_x + hipthreadidx_x; if (i < numelements) { C[i] = A[i] + B[i]; 39

40 HIP Support Devices (hipsetdevice(), hipgetdeviceproperties(), etc) Memory management (hipmalloc(), hipmemcpy(), hipfree()) Streams (hipstreamcreate(), etc.)---under development Events (hipeventrecord(), hipeventelapsedtime(), etc.) Kernel launching CUDA-style kernel indexing Device-side math built-ins Error reporting (hipgetlasterror(), hipgeterrorstring()) Partially supported or Not supported HiP is evolving Check for regular updates here: ProfessionalCompute-Tools/HIP Textures Constant memory Dynamic parallelism Managed memory CUDA Libraries Graphics interoperation with OpenGL or Direct3D CUDA array, mipmappedarray and pitched memory CUDA Driver API 40

41 Methods of kernel Launch CUDA vectoradd<<<blockspergrid, threadsperblock>>>(d_a, d_b, d_c, numelements); OpenCL clenqueuendrangekernel(command_queue, kernel, 1, NULL, &global_item_size, &local_item_size, 0, NULL, NULL); HIP hiplaunchkernel(hip_kernel_name(vectoradd), dim3(blockspergrid), dim3(threadsperblock), 0, 0, d_a, d_b, d_c, numelements); HIP Porting Guide: 41

42 Terminology Comparison CUDA OpenCL HIP CUDA OpenCL HIP thread work-item thread threadidx.x get_local_id(0) hipthreadidx_x block work-group block blockidx.x get_group_id(0) hipblockidx_x grid NDRange grid blockdim.x get_local_size(0) hipblockdim_x global kernel global <<< >>> clenqueuendrangekernel hiplaunchkernel 42

43 Programming with Shared Memory GPU Shared Memory Visible to all threads within Block Fast on-chip memory Like user managed cache 43

44 Parallel Reduction global void reduce0(int *g_idata, int *g_odata) { extern shared int sdata[]; // each thread loads one element from global to shared mem unsigned int tid = threadidx.x; unsigned int i = blockidx.x*blockdim.x + threadidx.x; sdata[tid] = g_idata[i]; sdata[threadidx.x] threadidx.x threadidx.x threadidx.x blockidx.x=0 blockidx.x=1 blockidx.x= syncthreads(); // do reduction in shared mem for(unsigned int s=1; s < blockdim.x; s *= 2) { if (tid % (2*s) == 0) { sdata[tid] += sdata[tid + s]; syncthreads(); // write result for this block to global mem i f (tid == 0) g_odata[blockidx.x] = sdata[0]; g_idata[i]

45 Parallel Reduction global void reduce0(int *g_idata, int *g_odata) { extern shared int sdata[]; // each thread loads one element from global to shared mem unsigned int tid = threadidx.x; unsigned int i = blockidx.x*blockdim.x + threadidx.x; sdata[tid] = g_idata[i]; sdata[threadidx.x] threadidx.x threadidx.x threadidx.x blockidx.x=0 blockidx.x=1 blockidx.x= syncthreads(); // do reduction in shared mem for(unsigned int s=1; s < blockdim.x; s *= 2) { if (tid % (2*s) == 0) { sdata[tid] += sdata[tid + s]; syncthreads(); // write result for this block to global mem i f (tid == 0) g_odata[blockidx.x] = sdata[0]; sdata[threadidx.x] sdata[threadidx.x] sdata[threadidx.x]

46 Parallel Reduction global void reduce0(int *g_idata, int *g_odata) { extern shared int sdata[]; // each thread loads one element from global to shared mem unsigned int tid = threadidx.x; unsigned int i = blockidx.x*blockdim.x + threadidx.x; sdata[tid] = g_idata[i]; sdata[threadidx.x] syncthreads(); threadidx.x threadidx.x threadidx.x blockidx.x=0 blockidx.x=1 blockidx.x= // do reduction in shared mem for(unsigned int s=1; s < blockdim.x; s *= 2) { if (tid % (2*s) == 0) { sdata[tid] += sdata[tid + s]; syncthreads(); // write result for this block to global mem i f (tid == 0) g_odata[blockidx.x] = sdata[0]; g_odata[blockidx.x]

47 GPU Accelerated Libraries Application Fast Fourier Transforms Basic Linear Algebra cufft clfft cublas clblas CLBlast Sparse Matrix Dense and Sparse Direct Solvers cusparse clspmv cusolver SuperLU Parallel algorithms and data structures Random Number Generation Maths Library Thrust clpp curand ArrayFire 47

48 Achieved Speedup with respect to their original performance Success Stories 300 Applications SpeedUp with GPUs SpeedUp Incremental Risk Charge (IRC) [Hardware footprint reduction by 50X] Value at Risk (VaR) Digital Watermarking Principal component Analysis (PCA) Mark-to-Market Computation for IRS portfolio 2D-LBM for Fluid Flows 48

49 Concluding remarks Many technologies to develop applications on GPUs Some are easy to use (OpenACC) but may result in limited performance gains CUDA/Nvidia has good overall presence but proprietary technology may not be acceptable to some OpenCL offers flexibility of porting between HWs Little tedious way of programming compared to others HIP Good tool to convert and run a CUDA application on AMD device In general GPUs offer good performance benefit for massive data parallel applications 49

50 Thank You 50

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca (Slides http://support.scinet.utoronto.ca/ northrup/westgrid CUDA.pdf) March 12, 2014

More information

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca March 13, 2014 Outline 1 Heterogeneous Computing 2 GPGPU - Overview Hardware Software

More information

Rock em Graphic Cards

Rock em Graphic Cards Rock em Graphic Cards Agnes Meyder 27.12.2013, 16:00 1 / 61 Layout Motivation Parallelism Old Standards OpenMPI OpenMP Accelerator Cards CUDA OpenCL OpenACC Hardware C++AMP The End 2 / 61 Layout Motivation

More information

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list CUDA Programming Week 1. Basic Programming Concepts Materials are copied from the reference list G80/G92 Device SP: Streaming Processor (Thread Processors) SM: Streaming Multiprocessor 128 SP grouped into

More information

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique GPU programming: CUDA basics Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr This lecture: CUDA programming We have seen some GPU architecture Now how to program it? 2 Outline

More information

HPCSE II. GPU programming and CUDA

HPCSE II. GPU programming and CUDA HPCSE II GPU programming and CUDA What is a GPU? Specialized for compute-intensive, highly-parallel computation, i.e. graphic output Evolution pushed by gaming industry CPU: large die area for control

More information

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming KFUPM HPC Workshop April 29-30 2015 Mohamed Mekias HPC Solutions Consultant Introduction to CUDA programming 1 Agenda GPU Architecture Overview Tools of the Trade Introduction to CUDA C Patterns of Parallel

More information

GPU Computing: Introduction to CUDA. Dr Paul Richmond

GPU Computing: Introduction to CUDA. Dr Paul Richmond GPU Computing: Introduction to CUDA Dr Paul Richmond http://paulrichmond.shef.ac.uk This lecture CUDA Programming Model CUDA Device Code CUDA Host Code and Memory Management CUDA Compilation Programming

More information

Heterogeneous Computing

Heterogeneous Computing OpenCL Hwansoo Han Heterogeneous Computing Multiple, but heterogeneous multicores Use all available computing resources in system [AMD APU (Fusion)] Single core CPU, multicore CPU GPUs, DSPs Parallel programming

More information

Programming with CUDA and OpenCL. Dana Schaa and Byunghyun Jang Northeastern University

Programming with CUDA and OpenCL. Dana Schaa and Byunghyun Jang Northeastern University Programming with CUDA and OpenCL Dana Schaa and Byunghyun Jang Northeastern University Tutorial Overview CUDA - Architecture and programming model - Strengths and limitations of the GPU - Example applications

More information

Scientific Computing WS 2018/2019. Lecture 25. Jürgen Fuhrmann Lecture 25 Slide 1

Scientific Computing WS 2018/2019. Lecture 25. Jürgen Fuhrmann Lecture 25 Slide 1 Scientific Computing WS 2018/2019 Lecture 25 Jürgen Fuhrmann juergen.fuhrmann@wias-berlin.de Lecture 25 Slide 1 Lecture 25 Slide 2 SIMD Hardware: Graphics Processing Units ( GPU) [Source: computing.llnl.gov/tutorials]

More information

Real-time Graphics 9. GPGPU

Real-time Graphics 9. GPGPU Real-time Graphics 9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing

More information

ECE 574 Cluster Computing Lecture 17

ECE 574 Cluster Computing Lecture 17 ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 6 April 2017 HW#8 will be posted Announcements HW#7 Power outage Pi Cluster Runaway jobs (tried

More information

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include 3.1 Overview Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include GPUs (Graphics Processing Units) AMD/ATI

More information

Scientific Computing WS 2017/2018. Lecture 27. Jürgen Fuhrmann Lecture 27 Slide 1

Scientific Computing WS 2017/2018. Lecture 27. Jürgen Fuhrmann Lecture 27 Slide 1 Scientific Computing WS 2017/2018 Lecture 27 Jürgen Fuhrmann juergen.fuhrmann@wias-berlin.de Lecture 27 Slide 1 Lecture 27 Slide 2 Why parallelization? Computers became faster and faster without that...

More information

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Lecture 8: GPU Programming. CSE599G1: Spring 2017 Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library

More information

CUDA C/C++ BASICS. NVIDIA Corporation

CUDA C/C++ BASICS. NVIDIA Corporation CUDA C/C++ BASICS NVIDIA Corporation What is CUDA? CUDA Architecture Expose GPU parallelism for general-purpose computing Retain performance CUDA C/C++ Based on industry-standard C/C++ Small set of extensions

More information

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh GPU Programming EPCC The University of Edinburgh Contents NVIDIA CUDA C Proprietary interface to NVIDIA architecture CUDA Fortran Provided by PGI OpenCL Cross platform API 2 NVIDIA CUDA CUDA allows NVIDIA

More information

Real-time Graphics 9. GPGPU

Real-time Graphics 9. GPGPU 9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing GPGPU general-purpose

More information

An Introduction to GPU Computing and CUDA Architecture

An Introduction to GPU Computing and CUDA Architecture An Introduction to GPU Computing and CUDA Architecture Sarah Tariq, NVIDIA Corporation GPU Computing GPU: Graphics Processing Unit Traditionally used for real-time rendering High computational density

More information

CUDA C/C++ Basics GTC 2012 Justin Luitjens, NVIDIA Corporation

CUDA C/C++ Basics GTC 2012 Justin Luitjens, NVIDIA Corporation CUDA C/C++ Basics GTC 2012 Justin Luitjens, NVIDIA Corporation What is CUDA? CUDA Platform Expose GPU computing for general purpose Retain performance CUDA C/C++ Based on industry-standard C/C++ Small

More information

Scientific Computing WS 2017/2018. Lecture 28. Jürgen Fuhrmann Lecture 28 Slide 1

Scientific Computing WS 2017/2018. Lecture 28. Jürgen Fuhrmann Lecture 28 Slide 1 Scientific Computing WS 2017/2018 Lecture 28 Jürgen Fuhrmann juergen.fuhrmann@wias-berlin.de Lecture 28 Slide 1 SIMD Hardware: Graphics Processing Units ( GPU) [Source: computing.llnl.gov/tutorials] Principle

More information

GPU programming: CUDA. Sylvain Collange Inria Rennes Bretagne Atlantique

GPU programming: CUDA. Sylvain Collange Inria Rennes Bretagne Atlantique GPU programming: CUDA Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr This lecture: CUDA programming We have seen some GPU architecture Now how to program it? 2 Outline GPU

More information

CUDA C/C++ BASICS. NVIDIA Corporation

CUDA C/C++ BASICS. NVIDIA Corporation CUDA C/C++ BASICS NVIDIA Corporation What is CUDA? CUDA Architecture Expose GPU parallelism for general-purpose computing Retain performance CUDA C/C++ Based on industry-standard C/C++ Small set of extensions

More information

GPU 1. CSCI 4850/5850 High-Performance Computing Spring 2018

GPU 1. CSCI 4850/5850 High-Performance Computing Spring 2018 GPU 1 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning Objectives

More information

OpenCL. Matt Sellitto Dana Schaa Northeastern University NUCAR

OpenCL. Matt Sellitto Dana Schaa Northeastern University NUCAR OpenCL Matt Sellitto Dana Schaa Northeastern University NUCAR OpenCL Architecture Parallel computing for heterogenous devices CPUs, GPUs, other processors (Cell, DSPs, etc) Portable accelerated code Defined

More information

CUDA. Sathish Vadhiyar High Performance Computing

CUDA. Sathish Vadhiyar High Performance Computing CUDA Sathish Vadhiyar High Performance Computing Hierarchical Parallelism Parallel computations arranged as grids One grid executes after another Grid consists of blocks Blocks assigned to SM. A single

More information

Linux Clusters Institute: Getting the Most from Your Linux Cluster

Linux Clusters Institute: Getting the Most from Your Linux Cluster Linux Clusters Institute: Getting the Most from Your Linux Cluster Advanced Topics: GPU Clusters Mike Showerman mshow@ncsa.illinois.edu Our background Cluster models for HPC in 1996 I believe first compute

More information

Introduction to CUDA Programming

Introduction to CUDA Programming Introduction to CUDA Programming Steve Lantz Cornell University Center for Advanced Computing October 30, 2013 Based on materials developed by CAC and TACC Outline Motivation for GPUs and CUDA Overview

More information

CSE 599 I Accelerated Computing Programming GPUS. Intro to CUDA C

CSE 599 I Accelerated Computing Programming GPUS. Intro to CUDA C CSE 599 I Accelerated Computing Programming GPUS Intro to CUDA C GPU Teaching Kit Accelerated Computing Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn the

More information

CUDA Exercises. CUDA Programming Model Lukas Cavigelli ETZ E 9 / ETZ D Integrated Systems Laboratory

CUDA Exercises. CUDA Programming Model Lukas Cavigelli ETZ E 9 / ETZ D Integrated Systems Laboratory CUDA Exercises CUDA Programming Model 05.05.2015 Lukas Cavigelli ETZ E 9 / ETZ D 61.1 Integrated Systems Laboratory Exercises 1. Enumerate GPUs 2. Hello World CUDA kernel 3. Vectors Addition Threads and

More information

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Basic Elements of CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of

More information

Parallel Accelerators

Parallel Accelerators Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon

More information

OpenCL. Computation on HybriLIT Brief introduction and getting started

OpenCL. Computation on HybriLIT Brief introduction and getting started OpenCL Computation on HybriLIT Brief introduction and getting started Alexander Ayriyan Laboratory of Information Technologies Joint Institute for Nuclear Research 05.09.2014 (Friday) Tutorial in frame

More information

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Josef Pelikán, Jan Horáček CGG MFF UK Praha GPGPU and CUDA 2012-2018 Josef Pelikán, Jan Horáček CGG MFF UK Praha pepca@cgg.mff.cuni.cz http://cgg.mff.cuni.cz/~pepca/ 1 / 41 Content advances in hardware multi-core vs. many-core general computing

More information

Lecture 11: GPU programming

Lecture 11: GPU programming Lecture 11: GPU programming David Bindel 4 Oct 2011 Logistics Matrix multiply results are ready Summary on assignments page My version (and writeup) on CMS HW 2 due Thursday Still working on project 2!

More information

Lecture 6b Introduction of CUDA programming

Lecture 6b Introduction of CUDA programming CS075 1896 1920 1987 2006 Lecture 6b Introduction of CUDA programming 0 1 0, What is CUDA? CUDA Architecture Expose GPU parallelism for general-purpose computing Retain performance CUDA C/C++ Based on

More information

Introduction to CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator

Introduction to CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator Introduction to CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator What is CUDA? Programming language? Compiler? Classic car? Beer? Coffee? CUDA Parallel Computing Platform www.nvidia.com/getcuda Programming

More information

Advanced OpenMP. Other threading APIs

Advanced OpenMP. Other threading APIs Advanced OpenMP Other threading APIs What s wrong with OpenMP? OpenMP is designed for programs where you want a fixed number of threads, and you always want the threads to be consuming CPU cycles. - cannot

More information

Parallel Accelerators

Parallel Accelerators Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon

More information

GPU Computing with CUDA

GPU Computing with CUDA GPU Computing with CUDA Hands-on: CUDA Profiling, Thrust Dan Melanz & Dan Negrut Simulation-Based Engineering Lab Wisconsin Applied Computing Center Department of Mechanical Engineering University of Wisconsin-Madison

More information

PPAR: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique PPAR

PPAR: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique PPAR PPAR: CUDA basics Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr http://www.irisa.fr/alf/collange/ PPAR - 2018 This lecture: CUDA programming We have seen some GPU architecture

More information

Masterpraktikum Scientific Computing

Masterpraktikum Scientific Computing Masterpraktikum Scientific Computing High-Performance Computing Michael Bader Alexander Heinecke Technische Universität München, Germany Outline Intel Cilk Plus OpenCL Übung, October 7, 2012 2 Intel Cilk

More information

Massively Parallel Algorithms

Massively Parallel Algorithms Massively Parallel Algorithms Introduction to CUDA & Many Fundamental Concepts of Parallel Programming G. Zachmann University of Bremen, Germany cgvr.cs.uni-bremen.de Hybrid/Heterogeneous Computation/Architecture

More information

$ cd ex $ cd 1.Enumerate_GPUs $ make $./enum_gpu.x. Enter in the application folder Compile the source code Launch the application

$ cd ex $ cd 1.Enumerate_GPUs $ make $./enum_gpu.x. Enter in the application folder Compile the source code Launch the application $ cd ex $ cd 1.Enumerate_GPUs $ make $./enum_gpu.x Enter in the application folder Compile the source code Launch the application --- General Information for device 0 --- Name: xxxx Compute capability:

More information

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan CUDA Workshop High Performance GPU computing EXEBIT- 2014 Karthikeyan CPU vs GPU CPU Very fast, serial, Low Latency GPU Slow, massively parallel, High Throughput Play Demonstration Compute Unified Device

More information

GPU programming. Dr. Bernhard Kainz

GPU programming. Dr. Bernhard Kainz GPU programming Dr. Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages GPU programming paradigms Pitfalls and best practice Reduction and tiling

More information

Don t reinvent the wheel. BLAS LAPACK Intel Math Kernel Library

Don t reinvent the wheel. BLAS LAPACK Intel Math Kernel Library Libraries Don t reinvent the wheel. Specialized math libraries are likely faster. BLAS: Basic Linear Algebra Subprograms LAPACK: Linear Algebra Package (uses BLAS) http://www.netlib.org/lapack/ to download

More information

Introduction to OpenACC. 16 May 2013

Introduction to OpenACC. 16 May 2013 Introduction to OpenACC 16 May 2013 GPUs Reaching Broader Set of Developers 1,000,000 s 100,000 s Early Adopters Research Universities Supercomputing Centers Oil & Gas CAE CFD Finance Rendering Data Analytics

More information

GPU Programming Introduction

GPU Programming Introduction GPU Programming Introduction DR. CHRISTOPH ANGERER, NVIDIA AGENDA Introduction to Heterogeneous Computing Using Accelerated Libraries GPU Programming Languages Introduction to CUDA Lunch What is Heterogeneous

More information

Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President

Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President 4 th Annual Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President Copyright Khronos Group, 2009 - Page 1 CPUs Multiple cores driving performance increases Emerging Intersection GPUs Increasingly

More information

Performing Reductions in OpenCL

Performing Reductions in OpenCL Performing Reductions in OpenCL Mike Bailey mjb@cs.oregonstate.edu opencl.reduction.pptx Recall the OpenCL Model Kernel Global Constant Local Local Local Local Work- ItemWork- ItemWork- Item Here s the

More information

GPU Programming Using CUDA

GPU Programming Using CUDA GPU Programming Using CUDA Michael J. Schnieders Depts. of Biomedical Engineering & Biochemistry The University of Iowa & Gregory G. Howes Department of Physics and Astronomy The University of Iowa Iowa

More information

Basics of CADA Programming - CUDA 4.0 and newer

Basics of CADA Programming - CUDA 4.0 and newer Basics of CADA Programming - CUDA 4.0 and newer Feb 19, 2013 Outline CUDA basics Extension of C Single GPU programming Single node multi-gpus programing A brief introduction on the tools Jacket CUDA FORTRAN

More information

GPU Architecture and Programming. Andrei Doncescu inspired by NVIDIA

GPU Architecture and Programming. Andrei Doncescu inspired by NVIDIA GPU Architecture and Programming Andrei Doncescu inspired by NVIDIA Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions are executed

More information

Vector Addition on the Device: main()

Vector Addition on the Device: main() Vector Addition on the Device: main() #define N 512 int main(void) { int *a, *b, *c; // host copies of a, b, c int *d_a, *d_b, *d_c; // device copies of a, b, c int size = N * sizeof(int); // Alloc space

More information

Programmable Graphics Hardware (GPU) A Primer

Programmable Graphics Hardware (GPU) A Primer Programmable Graphics Hardware (GPU) A Primer Klaus Mueller Stony Brook University Computer Science Department Parallel Computing Explained video Parallel Computing Explained Any questions? Parallelism

More information

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA PROGRAMMING MODEL Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA: COMMON UNIFIED DEVICE ARCHITECTURE Parallel computing architecture and programming model GPU Computing Application Includes

More information

OpenCL on the GPU. San Jose, CA September 30, Neil Trevett and Cyril Zeller, NVIDIA

OpenCL on the GPU. San Jose, CA September 30, Neil Trevett and Cyril Zeller, NVIDIA OpenCL on the GPU San Jose, CA September 30, 2009 Neil Trevett and Cyril Zeller, NVIDIA Welcome to the OpenCL Tutorial! Khronos and industry perspective on OpenCL Neil Trevett Khronos Group President OpenCL

More information

Introduc)on to GPU Programming

Introduc)on to GPU Programming Introduc)on to GPU Programming Mubashir Adnan Qureshi h3p://www.ncsa.illinois.edu/people/kindr/projects/hpca/files/singapore_p1.pdf h3p://developer.download.nvidia.com/cuda/training/nvidia_gpu_compu)ng_webinars_cuda_memory_op)miza)on.pdf

More information

GPGPU IGAD 2014/2015. Lecture 1. Jacco Bikker

GPGPU IGAD 2014/2015. Lecture 1. Jacco Bikker GPGPU IGAD 2014/2015 Lecture 1 Jacco Bikker Today: Course introduction GPGPU background Getting started Assignment Introduction GPU History History 3DO-FZ1 console 1991 History NVidia NV-1 (Diamond Edge

More information

University of Bielefeld

University of Bielefeld Geistes-, Natur-, Sozial- und Technikwissenschaften gemeinsam unter einem Dach Introduction to GPU Programming using CUDA Olaf Kaczmarek University of Bielefeld STRONGnet Summerschool 2011 ZIF Bielefeld

More information

GPU Programming with CUDA. Pedro Velho

GPU Programming with CUDA. Pedro Velho GPU Programming with CUDA Pedro Velho Meeting the audience! How many of you used concurrent programming before? How many threads? How many already used CUDA? Introduction from games to science 1 2 Architecture

More information

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1 Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei

More information

A Tutorial on CUDA Performance Optimizations

A Tutorial on CUDA Performance Optimizations A Tutorial on CUDA Performance Optimizations Amit Kalele Prasad Pawar Parallelization & Optimization CoE TCS Pune 1 Outline Overview of GPU architecture Optimization Part I Block and Grid size Shared memory

More information

ECE 574 Cluster Computing Lecture 15

ECE 574 Cluster Computing Lecture 15 ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements

More information

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Basic Elements of CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

ECE 574 Cluster Computing Lecture 17

ECE 574 Cluster Computing Lecture 17 ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 March 2019 HW#8 (CUDA) posted. Project topics due. Announcements 1 CUDA installing On Linux

More information

GPU CUDA Programming

GPU CUDA Programming GPU CUDA Programming 이정근 (Jeong-Gun Lee) 한림대학교컴퓨터공학과, 임베디드 SoC 연구실 www.onchip.net Email: Jeonggun.Lee@hallym.ac.kr ALTERA JOINT LAB Introduction 차례 Multicore/Manycore and GPU GPU on Medical Applications

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna14/ [ 10 ] GPU and CUDA Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture and Performance

More information

Scientific Computing WS 2018/2019. Lecture 24. Jürgen Fuhrmann Lecture 24 Slide 1

Scientific Computing WS 2018/2019. Lecture 24. Jürgen Fuhrmann Lecture 24 Slide 1 Scientific Computing WS 2018/2019 Lecture 24 Jürgen Fuhrmann juergen.fuhrmann@wias-berlin.de Lecture 24 Slide 1 Lecture 24 Slide 2 TOP 500 2018 rank 1-6 Based on linpack benchmark: solution of dense linear

More information

OpenCL. Dr. David Brayford, LRZ, PRACE PATC: Intel MIC & GPU Programming Workshop

OpenCL. Dr. David Brayford, LRZ, PRACE PATC: Intel MIC & GPU Programming Workshop OpenCL Dr. David Brayford, LRZ, brayford@lrz.de PRACE PATC: Intel MIC & GPU Programming Workshop 1 Open Computing Language Open, royalty-free standard C-language extension For cross-platform, parallel

More information

Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President. Copyright Khronos Group, Page 1

Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President. Copyright Khronos Group, Page 1 Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President Copyright Khronos Group, 2009 - Page 1 Introduction and aims of OpenCL - Neil Trevett, NVIDIA OpenCL Specification walkthrough - Mike

More information

Practical Introduction to CUDA and GPU

Practical Introduction to CUDA and GPU Practical Introduction to CUDA and GPU Charlie Tang Centre for Theoretical Neuroscience October 9, 2009 Overview CUDA - stands for Compute Unified Device Architecture Introduced Nov. 2006, a parallel computing

More information

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra) CS 470 Spring 2016 Mike Lam, Professor Other Architectures (with an aside on linear algebra) Parallel Systems Shared memory (uniform global address space) Primary story: make faster computers Programming

More information

OpenCL The Open Standard for Heterogeneous Parallel Programming

OpenCL The Open Standard for Heterogeneous Parallel Programming OpenCL The Open Standard for Heterogeneous Parallel Programming March 2009 Copyright Khronos Group, 2009 - Page 1 Close-to-the-Silicon Standards Khronos creates Foundation-Level acceleration APIs - Needed

More information

Overview: Graphics Processing Units

Overview: Graphics Processing Units advent of GPUs GPU architecture Overview: Graphics Processing Units the NVIDIA Fermi processor the CUDA programming model simple example, threads organization, memory model case study: matrix multiply

More information

COSC 6374 Parallel Computations Introduction to CUDA

COSC 6374 Parallel Computations Introduction to CUDA COSC 6374 Parallel Computations Introduction to CUDA Edgar Gabriel Fall 2014 Disclaimer Material for this lecture has been adopted based on various sources Matt Heavener, CS, State Univ. of NY at Buffalo

More information

GPU COMPUTING. Ana Lucia Varbanescu (UvA)

GPU COMPUTING. Ana Lucia Varbanescu (UvA) GPU COMPUTING Ana Lucia Varbanescu (UvA) 2 Graphics in 1980 3 Graphics in 2000 4 Graphics in 2015 GPUs in movies 5 From Ariel in Little Mermaid to Brave So 6 GPUs are a steady market Gaming CAD-like activities

More information

Technology for a better society. hetcomp.com

Technology for a better society. hetcomp.com Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction

More information

CUDA Parallel Programming Model. Scalable Parallel Programming with CUDA

CUDA Parallel Programming Model. Scalable Parallel Programming with CUDA CUDA Parallel Programming Model Scalable Parallel Programming with CUDA Some Design Goals Scale to 100s of cores, 1000s of parallel threads Let programmers focus on parallel algorithms not mechanics of

More information

Scientific discovery, analysis and prediction made possible through high performance computing.

Scientific discovery, analysis and prediction made possible through high performance computing. Scientific discovery, analysis and prediction made possible through high performance computing. An Introduction to GPGPU Programming Bob Torgerson Arctic Region Supercomputing Center November 21 st, 2013

More information

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation Introduction to GPU Computing Junjie Lai, NVIDIA Corporation Outline Evolution of GPU Computing Heterogeneous Computing CUDA Execution Model & Walkthrough of Hello World Walkthrough : 1D Stencil Once upon

More information

Data Parallelism. CSCI 5828: Foundations of Software Engineering Lecture 28 12/01/2016

Data Parallelism. CSCI 5828: Foundations of Software Engineering Lecture 28 12/01/2016 Data Parallelism CSCI 5828: Foundations of Software Engineering Lecture 28 12/01/2016 1 Goals Cover the material in Chapter 7 of Seven Concurrency Models in Seven Weeks by Paul Butcher Data Parallelism

More information

WebCL Overview and Roadmap

WebCL Overview and Roadmap Copyright Khronos Group, 2011 - Page 1 WebCL Overview and Roadmap Tasneem Brutch Chair WebCL Working Group Samsung Electronics Copyright Khronos Group, 2011 - Page 2 WebCL Motivation Enable high performance

More information

CUDA C Programming Mark Harris NVIDIA Corporation

CUDA C Programming Mark Harris NVIDIA Corporation CUDA C Programming Mark Harris NVIDIA Corporation Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory Models Programming Environment

More information

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 22: Introduction to OpenCL

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 22: Introduction to OpenCL CS/EE 217 GPU Architecture and Parallel Programming Lecture 22: Introduction to OpenCL Objective To Understand the OpenCL programming model basic concepts and data types OpenCL application programming

More information

Optimizing Parallel Reduction in CUDA

Optimizing Parallel Reduction in CUDA Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf Parallel Reduction Tree-based approach used within each

More information

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Dirk Ribbrock Faculty of Mathematics, TU dortmund 2016 Table of Contents Why parallel

More information

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro INTRODUCTION TO GPU COMPUTING WITH CUDA Topi Siro 19.10.2015 OUTLINE PART I - Tue 20.10 10-12 What is GPU computing? What is CUDA? Running GPU jobs on Triton PART II - Thu 22.10 10-12 Using libraries Different

More information

Card Sizes. Tesla K40: 2880 processors; 12 GB memory

Card Sizes. Tesla K40: 2880 processors; 12 GB memory Card Sizes Tesla K40: 2880 processors; 12 GB memory Data bigger than grid Maximum grid sizes Compute capability 1.0, 1D and 2D grids supported Compute capability 2, 3, 3D grids too. Grid sizes: 65,535

More information

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer Parallel Programming and Debugging with CUDA C Geoff Gerfin Sr. System Software Engineer CUDA - NVIDIA s Architecture for GPU Computing Broad Adoption Over 250M installed CUDA-enabled GPUs GPU Computing

More information

CUDA Parallel Programming Model Michael Garland

CUDA Parallel Programming Model Michael Garland CUDA Parallel Programming Model Michael Garland NVIDIA Research Some Design Goals Scale to 100s of cores, 1000s of parallel threads Let programmers focus on parallel algorithms not mechanics of a parallel

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

GPGPU COMPUTE ON AMD. Udeepta Bordoloi April 6, 2011

GPGPU COMPUTE ON AMD. Udeepta Bordoloi April 6, 2011 GPGPU COMPUTE ON AMD Udeepta Bordoloi April 6, 2011 WHY USE GPU COMPUTE CPU: scalar processing + Latency + Optimized for sequential and branching algorithms + Runs existing applications very well - Throughput

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh Learn CUDA in an Afternoon Alan Gray EPCC The University of Edinburgh Overview Introduction to CUDA Practical Exercise 1: Getting started with CUDA GPU Optimisation Practical Exercise 2: Optimising a CUDA

More information

High-Performance Computing Using GPUs

High-Performance Computing Using GPUs High-Performance Computing Using GPUs Luca Caucci caucci@email.arizona.edu Center for Gamma-Ray Imaging November 7, 2012 Outline Slide 1 of 27 Why GPUs? What is CUDA? The CUDA programming model Anatomy

More information

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum GPU & High Performance Computing (by NVIDIA) CUDA Compute Unified Device Architecture 29.02.2008 Florian Schornbaum GPU Computing Performance In the last few years the GPU has evolved into an absolute

More information

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research Introduction to CUDA CME343 / ME339 18 May 2011 James Balfour [ jbalfour@nvidia.com] NVIDIA Research CUDA Programing system for machines with GPUs Programming Language Compilers Runtime Environments Drivers

More information