GPU Memory Model Overview

Similar documents
GPU Memory Model. Adapted from:

GeForce4. John Montrym Henry Moreton

Graphics Processing Unit Architecture (GPU Arch)

The Application Stage. The Game Loop, Resource Management and Renderer Design

Working with Metal Overview

Graphics Performance Optimisation. John Spitzer Director of European Developer Technology

Real-Time Rendering (Echtzeitgraphik) Michael Wimmer

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

What s New with GPGPU?

PROFESSIONAL. WebGL Programming DEVELOPING 3D GRAPHICS FOR THE WEB. Andreas Anyuru WILEY. John Wiley & Sons, Ltd.

The Graphics Pipeline

Hardware-driven Visibility Culling Jeong Hyun Kim

Standards for WebVR. Neil Trevett. Khronos President Vice President Mobile Content,

Rendering Objects. Need to transform all geometry then

Grafica Computazionale: Lezione 30. Grafica Computazionale. Hiding complexity... ;) Introduction to OpenGL. lezione30 Introduction to OpenGL

Getting fancy with texture mapping (Part 2) CS559 Spring Apr 2017

Real - Time Rendering. Pipeline optimization. Michal Červeňanský Juraj Starinský

Portland State University ECE 588/688. Graphics Processors

Lecture 25: Board Notes: Threads and GPUs

GPUs and GPGPUs. Greg Blanton John T. Lubia

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský

Beginning Direct3D Game Programming: 1. The History of Direct3D Graphics

Threading Hardware in G80

Whiz-Bang Graphics and Media Performance for Java Platform, Micro Edition (JavaME)

Lecture 13: OpenGL Shading Language (GLSL)

General Purpose computation on GPUs. Liangjun Zhang 2/23/2005

A Data-Parallel Genealogy: The GPU Family Tree. John Owens University of California, Davis

Achieving High-performance Graphics on Mobile With the Vulkan API

EE 4702 GPU Programming

Copyright Khronos Group, Page Graphic Remedy. All Rights Reserved

The F-Buffer: A Rasterization-Order FIFO Buffer for Multi-Pass Rendering. Bill Mark and Kekoa Proudfoot. Stanford University

How to Work on Next Gen Effects Now: Bridging DX10 and DX9. Guennadi Riguer ATI Technologies

ECE 574 Cluster Computing Lecture 15

Lecture 2. Shaders, GLSL and GPGPU

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Next-Generation Graphics on Larrabee. Tim Foley Intel Corp

Spring 2009 Prof. Hyesoon Kim

DirectX10 Effects and Performance. Bryan Dudash

Spring 2011 Prof. Hyesoon Kim

Rationale for Non-Programmable Additions to OpenGL 2.0

Shaders (some slides taken from David M. course)

Programming Guide. Aaftab Munshi Dan Ginsburg Dave Shreiner. TT r^addison-wesley

Real-time Graphics 9. GPGPU

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

Mattan Erez. The University of Texas at Austin

Direct Rendering of Trimmed NURBS Surfaces

OpenGL Programmable Shaders

DirectX10 Effects. Sarah Tariq

Lecture 4: Geometry Processing. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Coding OpenGL ES 3.0 for Better Graphics Quality

GPGPU Applications. for Hydrological and Atmospheric Simulations. and Visualizations on the Web. Ibrahim Demir

Real-time Graphics 9. GPGPU

E.Order of Operations

Blis: Better Language for Image Stuff Project Proposal Programming Languages and Translators, Spring 2017

Lecture 6: Texture. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

CS427 Multicore Architecture and Parallel Computing

Real-Time Reyes: Programmable Pipelines and Research Challenges. Anjul Patney University of California, Davis

Graphics Hardware. Instructor Stephen J. Guy

Programming shaders & GPUs Christian Miller CS Fall 2011

The GPGPU Programming Model

X. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1

Real-World Applications of Computer Arithmetic

Accelerating CFD with Graphics Hardware

Data-Parallel Algorithms on GPUs. Mark Harris NVIDIA Developer Technology

A SIMD-efficient 14 Instruction Shader Program for High-Throughput Microtriangle Rasterization

UberFlow: A GPU-Based Particle Engine

A Trip Down The (2011) Rasterization Pipeline

Mattan Erez. The University of Texas at Austin

Rasterization Overview

Hardware Accelerated Volume Visualization. Leonid I. Dimitrov & Milos Sramek GMI Austrian Academy of Sciences

GeForce3 OpenGL Performance. John Spitzer

Programming Graphics Hardware

CUDA Programming Model

A Reconfigurable Architecture for Load-Balanced Rendering

Bifurcation Between CPU and GPU CPUs General purpose, serial GPUs Special purpose, parallel CPUs are becoming more parallel Dual and quad cores, roadm

OpenGL SUPERBIBLE. Fifth Edition. Comprehensive Tutorial and Reference. Richard S. Wright, Jr. Nicholas Haemel Graham Sellers Benjamin Lipchak

Jeremy W. Sheaffer 1 David P. Luebke 2 Kevin Skadron 1. University of Virginia Computer Science 2. NVIDIA Research

GPGPU in Film Production. Laurence Emms Pixar Animation Studios

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

ECE 574 Cluster Computing Lecture 17

GPU-Based Volume Rendering of. Unstructured Grids. João L. D. Comba. Fábio F. Bernardon UFRGS

Copyright Khronos Group Page 1

CS179 GPU Programming Introduction to CUDA. Lecture originally by Luke Durant and Tamas Szalay

Graphics Pipeline & APIs

The Rasterization Pipeline

Massively Parallel Architectures

Graphics Pipeline & APIs

GPUPY: EFFICIENTLY USING A GPU WITH PYTHON

Enhancing Traditional Rasterization Graphics with Ray Tracing. October 2015

Tutorial on GPU Programming #2. Joong-Youn Lee Supercomputing Center, KISTI

Scanline Rendering 2 1/42

B-KD Trees for Hardware Accelerated Ray Tracing of Dynamic Scenes

Introduction to CUDA (1 of n*)

ECE 571 Advanced Microprocessor-Based Design Lecture 18

Enhancing Traditional Rasterization Graphics with Ray Tracing. March 2015

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

ECE 571 Advanced Microprocessor-Based Design Lecture 20

CS130 : Computer Graphics Lecture 2: Graphics Pipeline. Tamar Shinar Computer Science & Engineering UC Riverside

Transcription:

GPU Memory Model Overview John Owens University of California, Davis Department of Electrical and Computer Engineering Institute for Data Analysis and Visualization SciDAC Institute for Ultrascale Visualization

Memory Hierarchy CPU and GPU Memory Hierarchy Disk CPU Main Memory CPU Caches GPU Video Memory CPU Registers GPU Caches GPU Constant Registers GPU Temporary Registers 2

CPU Memory Model At any program point Allocate/free local or global memory Random memory access Registers Read/write Local memory Read/write to stack Global memory Read/write to heap Disk Read/write to disk 3

Cell SPU memory model: 128 128b local registers 256 kb local store 6 cycles access time Explicit, asynchronous DMA access to main memory Allows comm/comp overlap No explicit I or D cache No disk access http://www.realworldtech.com/includes/images/articles/cell-1.gif 4

GPU Memory Model Much more restricted memory access Allocate/free memory only before computation Limited memory access during computation (kernel) Registers Read/write Local memory Does not exist Global memory Read-only during computation Write-only at end of computation (precomputed address) Disk access Does not exist 5

GPU Memory Model GPUs support many types of memory objects in hardware 1D, 2D, 3D grids 2D is most common (framebuffer, texture) 2D cube maps (6 faces of a cube) Mipmapped (prefiltered) versions DX10 adds arrayed datatypes Each native datatype has pros and cons from a general-purpose programming perspective 6

Traditional GPU Pipeline Inputs: Vertex data Texture data Output: Framebuffer Texture Vertex Buffer Vertex Rasterizer Fragment Frame Buffer(s)

GPU Memory Model (DX9) Extending memory functionality Copy from framebuffer to texture Texture reads from vertex processor Render to vertex buffer VS 3.0 GPUs Texture Vertex Buffer Vertex Rasterizer Fragment Frame Buffer(s)

GPU Memory Model (DX10, traditional) More flexible memory handling All programmable units can read texture Stream out after geometry processor Stream Out Arrayed Texture Vertex Buffer Vertex Geometry Rasterizer Fragment Arrayed Frame Buffer(s)

GPU Memory Model (DX10, new) DX10 provides resources Resources are flexible! DX10 Resources Vertex Geometry Rasterizer Fragment

GPU Memory API Each GPU memory type supports subset of the following operations CPU interface GPU interface 11

GPU Memory API CPU interface Allocate Free Copy CPU GPU Copy GPU CPU Copy GPU GPU Bind for read-only vertex stream access Bind for read-only random access Bind for write-only framebuffer access 12

GPU Memory API GPU (shader/kernel) interface Random-access read Stream read 13

DX10 View of Memory Resources Buffers Textures Views Resources Encompass buffers and textures Retained state is stored in resources Must be bound by API to pipeline stages before called Same subresource cannot be bound for both read and write simultaneously 14

DX10 View of Memory Resources Buffers Textures Views Buffers Collection of elements Few requirements on type or format (heterogeneous) Elements are 1-4 components (e.g. R8G8B8A8, 8b int, 4x32b float) No filtering, subresourcing, multisampling Layout effectively linear ( casting is possible) Examples: vertex buffers, index buffers, ConstantBuffers 15

DX10 View of Memory Resources Buffers Textures Views Textures Collection of texels Can be filtered, subresourced, arrayed, mipmapped Unlike buffers, must be declared with texel type Type impacts filtering Layout is opaque - enables memory layout optimization Examples: texture{1,2,3}d, mipmapped, cubemap 16

DX10 View of Memory Resources Buffers Textures Views Views mechanism for hardware interpretation of a resource in memory Allows structured access of subresources Restricting view may increase efficiency 17

Big Picture: GPU Memory Model GPUs are a mix of: Historical, fixed-function capabilities Newer, flexible, programmable capabilities Fixed-function: Known access patterns, behaviors Accelerated by special-purpose hardware Programmable: Unknown access patterns Generality good Memory model must account for both Consequence: Ample special-purpose functionality Consequence: Restricting flexibility may improve performance 18

DX10 Bind Rules Input Assembler: Buffers Shader Resource Input: Anything, but can only bind views Shader Constants: Must be created as shader constant, can t use in other views Depth/Stencil Output: not buffers/texture3d, only can bind views of other resources Vertex Geometry Rasterizer Fragment StreamOut: Buffers Render Target Output: Anything, but can only bind views

Example: Texture Texture mapping fundamental primitive in GPUs Most typical use: random access, bound for read only, 2D texture map Hardware-supported caching & filtering Texture Vertex Buffer Vertex Rasterizer Fragment Frame Buffer(s) 20

Example: Framebuffer Memory written by fragment processor Write-only GPU memory (from shader s point of view) FB is read-modify-write by the pipeline as a whole Displayed to screen Can also store GPGPU results (not just color) Vertex Buffer Vertex Rasterizer Fragment Frame Buffer(s) 21

Example: Render to Texture Very common in both graphics & GPGPU Allows multipass algorithms Pass 1: Write data into framebuffer Pass 2: Bind as texture, read from texture Store up to 32 32b FP values/pixel Texture Vertex Buffer Vertex Rasterizer Fragment Frame Buffer(s) 22

Example: Render to Vertex Array Enables top-of-pipe feedback loop Enables dynamic creation of geometry on GPU Vertex Buffer Vertex Rasterizer Fragment Frame Buffer(s) 23

Example: Stream Out to Vertex Buffer Enabled by DX10 StreamOut capability Expected to be used for dynamic geometry Recall geometry processor produces 0-n outputs per input Possible graphics applications: Expand point sprites Extrude silhouettes Extrude prisms/tets Stream Out Vertex Buffer Vertex Geometry 24

Summary Rich set of hardware primitives Designed for special purpose tasks, but often useful for general purpose ones Memory usage generally more restrictive than other processors Becoming more general-purpose and orthogonal Restricting generality allows hw/sw to cooperate for higher performance