GPU 101. Mike Bailey. Oregon State University. Oregon State University. Computer Graphics gpu101.pptx. mjb April 23, 2017

Similar documents
GPU 101. Mike Bailey. Oregon State University

NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA)

CUDA (Compute Unified Device Architecture)

A Review of OpenGL Compute Shaders

CS179 GPU Programming Introduction to CUDA. Lecture originally by Luke Durant and Tamas Szalay

Introduction to GPU programming with CUDA

CS 220: Introduction to Parallel Computing. Introduction to CUDA. Lecture 28

Portland State University ECE 588/688. Graphics Processors

CME 213 S PRING Eric Darve

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

CS 179: GPU Computing

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

Trends in HPC (hardware complexity and software challenges)

CUDA. GPU Computing. K. Cooper 1. 1 Department of Mathematics. Washington State University

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Multi-Processors and GPU

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

CS 179 Lecture 4. GPU Compute Architecture

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

Parallel Programming for Graphics

NVIDIA Fermi Architecture

OPENCL GPU BEST PRACTICES BENJAMIN COQUELLE MAY 2015

Accelerating CFD with Graphics Hardware

Accelerating image registration on GPUs

Chapter 3 Parallel Software

CUDA Programming Model

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

Introduction to GPU hardware and to CUDA

Parallel Programming: Background Information

Parallel Programming: Background Information

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CUDA Threads. Origins. ! The CPU processing core 5/4/11

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

rabbit.engr.oregonstate.edu What is rabbit?

Technology for a better society. hetcomp.com

Beyond Programmable Shading. Scheduling the Graphics Pipeline

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Handout 3. HSAIL and A SIMT GPU Simulator

Administrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know.

GPUs and GPGPUs. Greg Blanton John T. Lubia

GPGPU Applications. for Hydrological and Atmospheric Simulations. and Visualizations on the Web. Ibrahim Demir

Heterogeneous-Race-Free Memory Models

CS 101, Mock Computer Architecture

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Advanced and parallel architectures

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher

CS195V Week 9. GPU Architecture and Other Shading Languages

GPU Basics. Introduction to GPU. S. Sundar and M. Panchatcharam. GPU Basics. S. Sundar & M. Panchatcharam. Super Computing GPU.

GPU Background. GPU Architectures for Non-Graphics People. David Black-Schaffer David Black-Schaffer 1

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

NVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas

ECE 574 Cluster Computing Lecture 15

Programmer's View of Execution Teminology Summary

CS427 Multicore Architecture and Parallel Computing

Numerical Simulation on the GPU

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

GPU Fundamentals Jeff Larkin November 14, 2016

Introduction II. Overview

CUDA Memories. Introduction 5/4/11

High Performance Computing on GPUs using NVIDIA CUDA

ME964 High Performance Computing for Engineering Applications

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

ECE 574 Cluster Computing Lecture 16

From Shader Code to a Teraflop: How Shader Cores Work

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Scientific Computing on GPUs: GPU Architecture Overview

Parallel Programming on Larrabee. Tim Foley Intel Corp

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

Real-time Graphics 9. GPGPU

Real-time Graphics 9. GPGPU

The Rasterization Pipeline

Antonio R. Miele Marco D. Santambrogio

Efficient and Scalable Shading for Many Lights

BlueGene/L (No. 4 in the Latest Top500 List)

Mathematical computations with GPUs

Bifurcation Between CPU and GPU CPUs General purpose, serial GPUs Special purpose, parallel CPUs are becoming more parallel Dual and quad cores, roadm

From Brook to CUDA. GPU Technology Conference

Preparing seismic codes for GPUs and other

Computer Architecture 计算机体系结构. Lecture 10. Data-Level Parallelism and GPGPU 第十讲 数据级并行化与 GPGPU. Chao Li, PhD. 李超博士

Shadows for Many Lights sounds like it might mean something, but In fact it can mean very different things, that require very different solutions.

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

CS 179: GPU Programming

E6895 Advanced Big Data Analytics Lecture 8: GPU Examples and GPU on ios devices

Programmable Graphics Hardware (GPU) A Primer

Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011

ECE 574 Cluster Computing Lecture 17

Final Lecture. A few minutes to wrap up and add some perspective

Real-Time Rendering Architectures

GPU for HPC. October 2010

Master Informatics Eng.

Current Trends in Computer Graphics Hardware

Kernel level AES Acceleration using GPUs

Transcription:

1 GPU 101 Mike Bailey mjb@cs.oregonstate.edu gpu101.pptx

Why do we care about GPU Programming? A History of GPU Performance vs. CPU Performance 2 Source: NVIDIA

How Can You Gain Access to GPU Power? 3 There are three ways: 1. Write a graphics display program ( 1985) 2. Write an application that looks like a graphics display program, but uses the fragment shader to do some computation ( 2002) 3. Write in OpenCL (or CUDA), which looks like C++ ( 2006)

The Core-Score. How can this be? 4

Why have GPUs Been Outpacing CPUs in Performance? 5 Due to the nature of graphics computations, GPU chips are customized to handle streaming data. Another reason is that GPU chips do not need the significant amount of cache space that occupies much of the real estate on general-purpose CPU chips. The GPU die real estate can then be re-targeted to hold more cores and thus to produce more processing power. Another reason is that general CPU chips contain on-chip logic to do branch prediction. This, too, takes up chip die space. Another reason is that general CPU chips contain on-chip logic to process instructions outof-order if the CPU is blocked and is waiting on something (e.g., a memory fetch). This, too, takes up chip die space. So, which is better, CPU or GPU? It depends on what you are trying to do!

Originally, GPU Devices were very task-specific 6

Today s GPU Devices are less task-specific 7

Consider the architecture of the NVIDIA Titan Black we have in our rabbit System 8 15 Streaming Multiprocessors (SMs) / chip 192 cores / SM Wow! 2880 cores / chip? Really?

What is a Core in the GPU Sense? 9 Look closely, and you ll see that NVIDIA really calls these CUDA Cores Look even more closely and you ll see that these CUDA Cores have no control logic they are pure compute units. (The surrounding SM has the control logic.) Other vendors refer to these as Lanes. You might also think of them as 192-way SIMD.

A Mechanical Equivalent 10 Streaming Multiprocessor CUDA Cores Data http://news.cision.com

How Many Robots Do You See Here? 11 12? 72? Depends what you count as a robot.

A Spec Sheet Example 12 Streaming Multiprocessors 128 CUDA Cores per SM Source: Tom s Hardware

The Bottom Line is This 13 So, the Titan Xp has 30 processors per chip, each of which is optimized to do 128-way SIMD. This is an amazing achievement in computing power. But, it is obvious that it is difficult to directly compare a CPU with a GPU. They are optimized to do different things. So, let s use the information about the architecture as a way to consider what CPUs should be good at and what GPUs should be good at CPU GPU General purpose programming Multi-core under user control Irregular data structures Irregular flow control Data parallel programming Little user control Regular data structures Regular Flow Control BTW, The general term in the OpenCL world for an SM is a Compute Unit. The general term in the OpenCL world for a CUDA Core is a Processing Element.

Compute Units and Processing Elements are Arranged in Grids 14 Device CU CU CU CU CU CU A GPU Platform can have one or more Devices. A GPU Device is organized as a grid of Compute Units. Each Compute Unit is organized as a grid of Processing Elements. So in NVIDIA terms, a Titan XP 30 Compute Units and each Compute Unit has 128 Processing Elements for a total of 3,840 Processing Elements on the chip. That s a lot of compute power! Compute Unit PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE

Thinking ahead to OpenCL 15 How can GPUs execute General C Code Efficiently? Ask them to do what they do best. Unless you have a very intense Data Parallel application, don t even think about using GPUs for computing. GPU programs expect you to not just have a few threads, but to have thousands of them! Each thread executes the same program (called the kernel), but operates on a different small piece of the overall data Thus, you have many, many threads, all waking up at about the same time, all executing the same kernel program, all hoping to work on a small piece of the overall problem. OpenCL has built-in functions so that each thread can figure out which thread number it is, and thus can figure out what part of the overall job it s supposed to do. When a thread gets blocked somehow (a memory access, waiting for information from another thread, etc.), the processor switches to executing another thread to work on.

So, the Trick is to Break your Problem into Many, Many Small Pieces 16 Particle Systems are a great example. 1. Have one thread per each particle. 2. Put all of the initial parameters into an array in GPU memory. 3. Tell each thread what the current Time is. 4. Each thread then computes its particle s position, color, etc. and writes it into arrays in GPU memory. 5. The CPU program then initiates OpenGL drawing of the information in those arrays. Note: once setup, the data never leaves GPU memory! Ben Weiss