NVIDIA s Compute Unified Device Architecture (CUDA)

Similar documents
NVIDIA s Compute Unified Device Architecture (CUDA)

CUDA (Compute Unified Device Architecture)

GPU 101. Mike Bailey. Oregon State University. Oregon State University. Computer Graphics gpu101.pptx. mjb April 23, 2017

GPU 101. Mike Bailey. Oregon State University

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Mathematical computations with GPUs

GPU Architecture. Alan Gray EPCC The University of Edinburgh

CS179 GPU Programming Introduction to CUDA. Lecture originally by Luke Durant and Tamas Szalay

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

High Performance Computing on GPUs using NVIDIA CUDA

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Parallel Programming: Background Information

Parallel Programming: Background Information

CME 213 S PRING Eric Darve

Introduction to CUDA

Graphics Processor Acceleration and YOU

Accelerating image registration on GPUs

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

GPUs and GPGPUs. Greg Blanton John T. Lubia

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

OpenCL implementation of PSO: a comparison between multi-core CPU and GPU performances

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to GPU hardware and to CUDA

From Shader Code to a Teraflop: How Shader Cores Work

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

A GPU based brute force de-dispersion algorithm for LOFAR

GPU for HPC. October 2010

Numerical Simulation on the GPU

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

CS 179: GPU Computing

Real-Time Rendering Architectures

Scientific Computing on GPUs: GPU Architecture Overview

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

CS 220: Introduction to Parallel Computing. Introduction to CUDA. Lecture 28

NVIDIA Fermi Architecture

GPU Architecture. Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)

Lecture 1: Gentle Introduction to GPUs

Heterogeneous-Race-Free Memory Models

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

High Performance Computing with Accelerators

Performance potential for simulating spin models on GPU

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

A Review of OpenGL Compute Shaders

Empirical Modeling: an Auto-tuning Method for Linear Algebra Routines on CPU plus Multi-GPU Platforms

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Handout 3. HSAIL and A SIMT GPU Simulator

CUDA. Matthew Joyner, Jeremy Williams

Lecture 1: Introduction and Computational Thinking

CS427 Multicore Architecture and Parallel Computing

Antonio R. Miele Marco D. Santambrogio

Tuning CUDA Applications for Fermi. Version 1.2

Fast Interactive Sand Simulation for Gesture Tracking systems Shrenik Lad

TUNING CUDA APPLICATIONS FOR MAXWELL

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

Dense matching GPU implementation

Lecture 7: The Programmable GPU Core. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Getting Started. NVIDIA CUDA C Installation and Verification on Mac OS X

ECE 8823: GPU Architectures. Objectives

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS

CUDA Architecture & Programming Model

GPU-ABiSort: Optimal Parallel Sorting on Stream Architectures

TUNING CUDA APPLICATIONS FOR MAXWELL

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Stream Processing with CUDA TM A Case Study Using Gamebryo's Floodgate Technology

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

NVIDIA CUDA C GETTING STARTED GUIDE FOR MAC OS X

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

GPUs and Emerging Architectures

Introduction to GPU programming with CUDA

"On the Capability and Achievable Performance of FPGAs for HPC Applications"

B. Tech. Project Second Stage Report on

Martin Dubois, ing. Contents

Trends in HPC (hardware complexity and software challenges)

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Efficient and Scalable Shading for Many Lights

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Why GPUs? Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)

GPU programming. Dr. Bernhard Kainz

Scalability of Processing on GPUs

GPGPU. Peter Laurens 1st-year PhD Student, NSC

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

JCudaMP: OpenMP/Java on CUDA

GPU Programming Using NVIDIA CUDA

NVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

Getting Started. NVIDIA CUDA Development Tools 2.2 Installation and Verification on Mac OS X. May 2009 DU _v01

GPGPU, 4th Meeting Mordechai Butrashvily, CEO GASS Company for Advanced Supercomputing Solutions

Support Tools for Porting Legacy Applications to Multicore. Natsuki Kawai, Yuri Ardila, Takashi Nakamura, Yosuke Tamura

Portland State University ECE 588/688. Graphics Processors

Introduction to GPGPU and GPU-architectures

Programmable Graphics Hardware (GPU) A Primer

A Simulated Annealing algorithm for GPU clusters

Transcription:

NVIDIA s Compute Unified Device Architecture (CUDA) Mike Bailey mjb@cs.oregonstate.edu

Reaching the Promised Land NVIDIA GPUs CUDA Knights Corner Speed Intel CPUs General Programmability

History of GPU Performance vs. CPU Performance GFLOPS Source: NVIDIA G80 = GeForce 8800 GTX G71 = GeForce 7900 GTX G70 = GeForce 7800 GTX NV40 = GeForce 6800 Ultra NV35 = GeForce FX 5950 Ultra NV30 = GeForce FX 5800

Why have GPUs Been Outpacing CPUs? Due to the nature of graphics computations, GPU chips are customized to handle streaming data. This means that the data is already sequential and thus the GPU chips do not need the significant amount of cache space that dominates the real estate on general-purpose CPU chips. The GPU die real estate can then be re-targeted to hold more cores and thus to produce more processing power. The other reason is that general CPU chips contain on-chip logic to process some instructions out-of-order if the CPU is blocked waiting on something (e.g., a memory fetch). This, too, takes up chip die space. For example. while Intel and AMD are now shipping CPU chips with 4 cores, NVIDIA is shipping GPU chips with 512. Overall, in four years, GPUs have achieved a 17.5- fold increase in performance, a compound annual increase of 2.05X.

How Can You Gain Access to that GPU Power? 1. Write a graphics display program ( 1985) 2. Write an application that looks like a graphics display program ( 2002) 3. Write in CUDA, which looks like C++ ( 2006)

CUDA Architecture The GPU has some number of MultiProcessors (MPs), depending on the model The NVIDIA Fermi 480 and above have 16 MPs An 8000-based MP has 8 independent processors (cores) A Fermi-based MP has 32 independent processors (cores) Memory is divided into Shared Memory and Constant Memory MPs Cores

The CUDA Paradigm C++ Program with both host and CUDA code in it Host code CUDA code Compiler and Linker Compiler and Linker CPU binary on the host CUDA binary on the GPU CUDA is an NVIDIA-only product, but it is very popular, and got the GPUas-CPU ball rolling, which has resulted in other products like OpenCL

If GPUs have so Little Cache, how can they Execute General C++ Code Efficiently? 1. Multiple Multiprocessors 2. Threads lots and lots of threads CUDA expects you to not just have a few threads, but to have thousands of them! All threads execute the same code (called the kernel), but operate on different data Each thread can figure out which number it is, and thus what its job is Think of all the threads as living in a pool, waiting to be executed All processors start by grabbing a thread from the pool When a thread gets blocked somehow (a memory access, waiting for information from another thread, etc.), the processor quickly returns the thread to the pool and grabs another one to work on. This thread-swap happens within a single cycle A full memory access requires 200 instruction cycles to complete

So, the Trick is to Break your Problem into Many, Many Small Pieces Particle Systems are a great example. 1. Have one thread per each particle. 2. Put all of the initial parameters into an array in GPU memory. 3. Tell each thread what the current Time is. 4. Each thread then computes its particle s position, color, etc. and writes it into arrays in GPU memory. 5. The CPU program then initiates drawing of the information in those arrays. Note: once setup, the data never leaves GPU memory! Ben Weiss