GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

Similar documents
Tesla Architecture, CUDA and Optimization Strategies

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM.

Introduction to Parallel Computing with CUDA. Oswald Haan

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

CUDA Architecture & Programming Model

Josef Pelikán, Jan Horáček CGG MFF UK Praha

GPU programming. Dr. Bernhard Kainz

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

CUDA Programming Model

GPU Programming Using CUDA

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Massively Parallel Architectures

High Performance Computing and GPU Programming

Scientific discovery, analysis and prediction made possible through high performance computing.

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

High-Performance Computing Using GPUs

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Review. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API

Programmable Graphics Hardware (GPU) A Primer

Real-time Graphics 9. GPGPU

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

Programming with CUDA, WS09

Programming in CUDA. Malik M Khan

Practical Introduction to CUDA and GPU

GPGPU/CUDA/C Workshop 2012

Real-time Graphics 9. GPGPU

CUDA C Programming Mark Harris NVIDIA Corporation

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Massively Parallel Computing with CUDA. Carlos Alberto Martínez Angeles Cinvestav-IPN

Introduction to CUDA

CUDA Parallelism Model

Paralization on GPU using CUDA An Introduction

Introduction to CUDA Programming

NVIDIA CUDA Compute Unified Device Architecture

Lecture 2: CUDA Programming

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

CDA3101 Recitation Section 13

High Performance Linear Algebra on Data Parallel Co-Processors I

GPU Programming Introduction

Lecture 3: Introduction to CUDA

HPCSE II. GPU programming and CUDA


Parallel Computing. Lecture 19: CUDA - I

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Introduction to CELL B.E. and GPU Programming. Agenda

ECE 574 Cluster Computing Lecture 17

Introduction to CUDA (1 of n*)

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

COSC 6374 Parallel Computations Introduction to CUDA

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

Lecture 10!! Introduction to CUDA!

Introduction to GPGPUs and to CUDA programming model

Introduction to CUDA C

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Cartoon parallel architectures; CPUs and GPUs

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Using CUDA. Oswald Haan

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Parallel Systems Course: Chapter IV. GPU Programming. Jan Lemeire Dept. ETRO November 6th 2008

Device Memories and Matrix Multiplication

Introduction to GPGPU and GPU-architectures

ECE 574 Cluster Computing Lecture 15

Speed Up Your Codes Using GPU

1/25/12. Administrative

CS 314 Principles of Programming Languages

CUDA/OpenGL Fluid Simulation. Nolan Goodnight

CUDA Basics. July 6, 2016

Lecture 1: Introduction and Computational Thinking

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

GPU CUDA Programming

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

Accelerating image registration on GPUs

Introduction to Parallel Programming

Introduction to CUDA C

Lecture 2: Introduction to CUDA C

An Introduction to GPU Computing and CUDA Architecture

Module 2: Introduction to CUDA C

GPU Programming Using NVIDIA CUDA

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Example 1: Color-to-Grayscale Image Processing

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

MIC-GPU: High-Performance Computing for Medical Imaging on Programmable Graphics Hardware (GPUs)

General-purpose computing on graphics processing units (GPGPU)

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

CUDA C/C++ BASICS. NVIDIA Corporation

GPU Computing with CUDA. Part 2: CUDA Introduction

CS516 Programming Languages and Compilers II

Transcription:

GPU & High Performance Computing (by NVIDIA) CUDA Compute Unified Device Architecture 29.02.2008 Florian Schornbaum

GPU Computing Performance In the last few years the GPU has evolved into an absolute computing workhorse : programmable processor very high memory bandwidth multiple cores (high parallelism)

CPU vs. GPU GPU originally specialized for math-intensive, highly parallel computation (exactly what graphics rendering is about) On the GPU (in contrast to the CPU) more transistors are devoted to data processing ( ALU) rather than data caching and flow control

Problem: GPGPU (general purpose computation on GPUs) GPGPU so far : program the GPU through a graphics API and trick the GPU into general-purpose computing by casting problems as graphics : turn data into images ( texture maps ) turn algorithms into image synthesis ( rendering passes )

Problem: GPGPU (general purpose computation on GPUs) GPGPU so far : program the GPU through a graphics API and trick the GPU into general-purpose computing by casting problems as graphics : turn data into images ( texture maps ) turn algorithms into image synthesis ( rendering passes ) Promising results, but : tough learning curve, particularly for non-graphics experts potentially high overhead of an inadequate graphics API highly constrained memory layout & access model

Solution: GPU Computing with CUDA Co-designed hardware & software for direct GPU computing : new hardware and software architecture for issuing and managing computations on the GPU as a data-parallel computing device without the need of mapping them to a graphics API (available for the GeForce 8 series, the Tesla platform and some Quadro solutions)

Solution: GPU Computing with CUDA Co-designed hardware & software for direct GPU computing : new hardware and software architecture for issuing and managing computations on the GPU as a data-parallel computing device without the need of mapping them to a graphics API (available for the GeForce 8 series, the Tesla platform and some Quadro solutions) 2 higher-level mathematical libraries of common usage : CUFFT and CUBLAS Hardware has been designed to support lightweight driver and runtime layers high performance

Hardware Implementation The device is implemented as a set of multiprocessors. Each multiprocessor has a single instruction multiple data architecture (SIMD).

Hardware Implementation The device is implemented as a set of multiprocessors. Each multiprocessor has a single instruction multiple data architecture (SIMD). At any given clock cycle, each processor of the multiprocessors executes the same instruction, but operates on different data.

Programming Model: A Highly Multi-threaded Coprocessor the GPU is viewed as a compute device capable of executing a very high number of threads in parallel

Programming Model: A Highly Multi-threaded Coprocessor the GPU is viewed as a compute device capable of executing a very high number of threads in parallel the GPU operates as a coprocessor to the main CPU (host) : data-parallel, compute-intensive portions of applications running on the host are off-loaded onto the device

Programming Model: A Highly Multi-threaded Coprocessor the GPU is viewed as a compute device capable of executing a very high number of threads in parallel the GPU operates as a coprocessor to the main CPU (host) : data-parallel, compute-intensive portions of applications running on the host are off-loaded onto the device a portion of an application that is executed many times, but on different data, can be isolated into a function that is executed on the device as many different threads

Programming Model: A Highly Multi-threaded Coprocessor the GPU is viewed as a compute device capable of executing a very high number of threads in parallel the GPU operates as a coprocessor to the main CPU (host) : data-parallel, compute-intensive portions of applications running on the host are off-loaded onto the device a portion of an application that is executed many times, but on different data, can be isolated into a function that is executed on the device as many different threads GPU needs thousands of threads for full efficiency (GPU threads are extremely lightweight and have very little creation overhead)

Application Programming Interface an extension to the C programming language function type qualifiers to specify whether a function executes on the host or the device

Application Programming Interface an extension to the C programming language function type qualifiers to specify whether a function executes on the host or the device explicit GPU memory allocation, returns pointers to GPU memory ( cudamalloc(), cudafree() ) memory can be copied from host to device, device to host and device to device ( cudamemcpy() )

Application Programming Interface an extension to the C programming language function type qualifiers to specify whether a function executes on the host or the device explicit GPU memory allocation, returns pointers to GPU memory ( cudamalloc(), cudafree() ) memory can be copied from host to device, device to host and device to device ( cudamemcpy() ) 4 build-in variables: grid & block dimension, block & thread index programming model (grid of thread blocks)

CPU C Program void add_matrix_cpu(float *a, float *b, float *c, int N) { int i, j, index; for(i = 0; i < N; i++) { for(j = 0; j < N; j++) { index = i + j * N; c[index] = a[index] + b[index]; } } } void main() {... add_matrix_cpu(a, b, c, N);... }

CUDA C Program global void add_matrix_gpu(float *a, float *b, float *c, int N) { int i = blockidx.x * blockdim.x + threadidx.x; int j = blockidx.y * blockdim.y + threadidx.y; int index = i + j * N; if(i < N && j < N) c[index] = a[index] + b[index]; } void main() {... dim3 dimblock(blocksize, blocksize); dim3 dimgrid(n / dimblock.x, N / dimblock.y); add_matrix_gpu<<<dimgrid,dimblock>>>(a, b, c, N);... }

CUDA Software Development Kit

CUDA vs. Standard CPU Performance (2007)

GPU Computing with CUDA THE END References : SUPERCOMPUTING 2007 Tutorial: High Performance Computing with CUDA http://www.gpgpu.org/sc2007/ CUDA Programming Guide 1.1 http://developer.download.nvidia.com/compute/cuda/1_1/nvidia_cuda_programming_guide_1.1.pdf