Lecture 6: Texture. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Similar documents
Lecture 6: Texturing Part II: Texture Compression and GPU Latency Hiding Mechanisms. Visual Computing Systems CMU , Fall 2014

Real-Time Graphics Architecture. Kurt Akeley Pat Hanrahan. Texture

From Shader Code to a Teraflop: How Shader Cores Work

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

Perspective Projection and Texture Mapping

Texture. Real-Time Graphics Architecture. Kurt Akeley Pat Hanrahan.

Real-Time Rendering Architectures

Lecture 9: Deferred Shading. Visual Computing Systems CMU , Fall 2013

GPU Architecture. Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)

GeForce4. John Montrym Henry Moreton

Lecture 4: Geometry Processing. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

CS427 Multicore Architecture and Parallel Computing

CS GPU and GPGPU Programming Lecture 3: GPU Architecture 2. Markus Hadwiger, KAUST

Graphics Processing Unit Architecture (GPU Arch)

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Optimizing for DirectX Graphics. Richard Huddy European Developer Relations Manager

Addressing the Memory Wall

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Scheduling the Graphics Pipeline on a GPU

Intro to Shading Languages

Lecture 23: Domain-Specific Parallel Programming

The Rasterization Pipeline

CS GPU and GPGPU Programming Lecture 3: GPU Architecture 2. Markus Hadwiger, KAUST

Performance Analysis and Culling Algorithms

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Windowing System on a 3D Pipeline. February 2005

CS GPU and GPGPU Programming Lecture 16+17: GPU Texturing 1+2. Markus Hadwiger, KAUST

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CS GPU and GPGPU Programming Lecture 12: GPU Texturing 1. Markus Hadwiger, KAUST

Spring 2010 Prof. Hyesoon Kim. AMD presentations from Richard Huddy and Michael Doggett

Computergrafik. Matthias Zwicker Universität Bern Herbst 2016

Course Recap + 3D Graphics on Mobile GPUs

Ray Tracing. Computer Graphics CMU /15-662, Fall 2016

Texture mapping. Computer Graphics CSE 167 Lecture 9

Monday Morning. Graphics Hardware

Graphics Architectures and OpenCL. Michael Doggett Department of Computer Science Lund university

CS GPU and GPGPU Programming Lecture 11: GPU Texturing 1. Markus Hadwiger, KAUST

Real-Time Buffer Compression. Michael Doggett Department of Computer Science Lund university

Programming Graphics Hardware

Optimizing and Profiling Unity Games for Mobile Platforms. Angelo Theodorou Senior Software Engineer, MPG Gamelab 2014, 25 th -27 th June

Real-Time Rendering (Echtzeitgraphik) Michael Wimmer

Texture Caches. Michael Doggett Department of Computer Science Lund University Sweden mike (at) cs.lth.se

Mattan Erez. The University of Texas at Austin

Today. Texture mapping in OpenGL. Texture mapping. Basic shaders for texturing. Today. Computergrafik

Administrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know.

Parallel Triangle Rendering on a Modern GPU

GCN Performance Tweets AMD Developer Relations

Lecture 13: Reyes Architecture and Implementation. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Lecture 7: The Programmable GPU Core. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

GPU Architecture. Michael Doggett Department of Computer Science Lund university

Bringing AAA graphics to mobile platforms. Niklas Smedberg Senior Engine Programmer, Epic Games

Working with Metal Overview

Evolution of GPUs Chris Seitz

Squeezing Performance out of your Game with ATI Developer Performance Tools and Optimization Techniques

Real-World Applications of Computer Arithmetic

Hardware-driven visibility culling

Efficient and Scalable Shading for Many Lights

Streaming Massive Environments From Zero to 200MPH

POWERVR MBX. Technology Overview

Texture Caching. Héctor Antonio Villa Martínez Universidad de Sonora

A Reconfigurable Architecture for Load-Balanced Rendering

The Rasterization Pipeline

CS 354R: Computer Game Technology

Portland State University ECE 588/688. Graphics Processors

Graphics Performance Optimisation. John Spitzer Director of European Developer Technology

Morphological: Sub-pixel Morhpological Anti-Aliasing [Jimenez 11] Fast AproXimatte Anti Aliasing [Lottes 09]

Architectures. Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

Hardware Accelerated Volume Visualization. Leonid I. Dimitrov & Milos Sramek GMI Austrian Academy of Sciences

PowerVR Series5. Architecture Guide for Developers

Lecture 25: Board Notes: Threads and GPUs

Mattan Erez. The University of Texas at Austin

Enhancing Traditional Rasterization Graphics with Ray Tracing. March 2015

MSAA- Based Coarse Shading

GPU Target Applications

Spring 2009 Prof. Hyesoon Kim

Render-To-Texture Caching. D. Sim Dietrich Jr.

Texturing Theory. Overview. All it takes is for the rendered image to look right. -Jim Blinn 11/10/2018

Shaders. Slide credit to Prof. Zwicker

Real-Time Graphics Architecture. Kurt Akeley Pat Hanrahan. Ray Tracing.

Using Virtual Texturing to Handle Massive Texture Data

Lecture 10: Shading Languages. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Spring 2011 Prof. Hyesoon Kim

CSE 167: Introduction to Computer Graphics Lecture #8: Textures. Jürgen P. Schulze, Ph.D. University of California, San Diego Spring Quarter 2016

Rendering. Converting a 3D scene to a 2D image. Camera. Light. Rendering. View Plane

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

GPU Architecture & CUDA Programming

Fundamental CUDA Optimization. NVIDIA Corporation

A SIMD-efficient 14 Instruction Shader Program for High-Throughput Microtriangle Rasterization

The Need for Programmability

TEXTURE MAPPING. DVA338 Computer Graphics Thomas Larsson, Afshin Ameri

Image Processing Tricks in OpenGL. Simon Green NVIDIA Corporation

Lecture 4: Visibility. (coverage and occlusion using rasterization and the Z-buffer) Visual Computing Systems CMU , Fall 2014

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

The Application Stage. The Game Loop, Resource Management and Renderer Design

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Non-Linearly Quantized Moment Shadow Maps

Current Trends in Computer Graphics Hardware

Transcription:

Lecture 6: Texture Kayvon Fatahalian CMU 15-869: Graphics and Imaging Architectures (Fall 2011)

Today: texturing! Texture filtering - Texture access is not just a 2D array lookup ;-) Memory-system implications - Caching - Storage layouts - Prefetching

Last time Rasterizer point-samples coverage (4 samples per pixel shown here) Z-buffer algorithm used to solve for occlusion at these points

Last time One fragment per covered pixel Triangle attributes are interpolated from vertex values Attributes [typically] sampled at pixel centers to generate values for fragment attributes (recall modern rasterizers may produce attribute eqns, not values)

Shading the fragment Let: lightdir = [-1, -1, 1] HLSL shader program: defines behavior of fragment processing stage mytex = sampler mysamp; Texture2D<float3> mytex; float3 lightdir; float4 diffuseshader(float3 norm, float2 uv) { float3 kd; kd = mytex.sample(mysamp, uv); kd *= clamp( dot(lightdir, norm), 0.0, 1.0); return float4(kd, 1.0); } function defined on [0,1] 2 domain: mytex : [0,1] 2 float3 (represented by 2048x2048 image) mysamp defines how to resample function to generate value at (u,v)

Texture coordinates (UV) Red channel = u Green channel = v

Shaded result

Texture space Screen space Texture space

Aliasing due to undersampling No pre-filtering (aliased result) Pre-filtered texture

Aliasing due to undersampling No pre-filtering (aliased result) Pre-filtered texture

Filtering textures Slide credit: Akeley and Hanrahan

Filtering textures...... Actual texture: 700x700 Texture minification Actual texture: 64x64 Texture magnification

Mipmap (L. Williams 83) Level 0 = 128x128 Level 1 = 64x64 Level 2 = 32x32 Level 3 = 16x16 Level 4 = 8x8 Level 5 = 4x4 Level 6 = 2x2 Level 7 = 1x1 Pre-filter texture data

Mipmap (L. Williams 83) v u Williams original proposed layout Mip hierarchy level = d Slide credit: Akeley and Hanrahan

Constant-time filtering mip-map level d+1 texels Bilinear: 3 lerps (3 mul + 6 add) Trilinear: 7 lerps (7 mul + 14 add) mip-map level d texels Slide credit: Akeley and Hanrahan

Computing d Take differences between texture coordinate values of neighboring fragments Screen space Texture space

Computing d Take differences between texture coordinate values of neighboring fragments (u,v)01 L (u,v)00 (u,v)10 L du/dx = u 10 -u 00 du/dy = u 01 -u 00 dv/dx = v 10 -v 00 dv/dy = v 01 -v 00 mip-map d = log2(l)

GPUs shade quad fragments (2x2 fragment blocks) Cheap, simple texture coordinate differentials + avoids communication

Multiple fragments shaded for pixels at triangle boundaries Shading computations per pixel 8 + 7 6 5 4 3 2 1

Quick aside: multi-sample AA (MSAA) Main idea: decouple shading sampling rate from visibility sampling rate Depth buffer: stores depth per sample Color buffer: stores color per sample Resample color buffer to get final image pixel values

Principle of texture thrift [Peachey 90] Given a scene consisting of textured 3D surfaces, the amount of texture information minimally required to render an image of the scene is...

Principle of texture thrift [Peachey 90] Given a scene consisting of textured 3D surfaces, the amount of texture information minimally required to render an image of the scene is proportional to the resolution of the image and is independent of the number of surfaces and the size of the textures.

Filtering with mip mapping summary Extra storage: 1/3 of original texture image For each texture filtering request - Constant filtering cost (independent of d) - Constant # texels accessed (independent of d) Pretty good quality - Assumption: isotropic filtering of texture function - Many improvements to handle anisotropic filtering (exist in current GPUs) - Higher quality, but greater compute and memory bandwidth cost

Texture mechanics For each texture fetch in a shader program: 1. Compute du/dx, du/dy, dv/dx, dv/dy differentials from quad fragment 2. Convert normalized values to texel coordinates 3. Compute d 4. Compute required texels ** 5. Load texture data **** 6. Result = tri-linear filter according to (u,v,d) Lots of math: All modern GPUs have sophisticated fixed-function hardware for texture processing! (note: even Larrabee design had texturing hardware) ** may involve wrap, clamp, etc. of u,v values according to sampling mode configuration **** may require decompression

Texture block diagram Texture request Programmable core (executes fragment shaders) Texture Processor (fixed-function) Texture response (fp32 rgba) Texture data cache GPU DRAM

Consider memory Texture footprint - Modern games: large textures: 10s-100s of MB - Film rendering: GBs to TBs of textures in scene DB Texture bandwidth - 8 texels per tri-linear fetch (assume 4 bytes/texel) - Modern GPU: billions of fragments/sec (NVIDIA GTX 580: ~40 billion/sec) Performant graphics systems need: - Texture caching - Latency hiding solution - Texture compression

Review: the role of caches in CPUs Reduce off-chip bandwidth requirements (caches service requests that would require DRAM access) - Note: Alternatively, you can think about caches as bandwidth amplifiers (data path between cache and ALUs is usually wider than that to DRAM) Reduce latency of data access Convert fine-grained memory requests from processors into large (cache-line sized) requests than can be serviced efficiently by DRAM

Texture caching thought experiment Assume: Row-major raster order Horizontal texels contiguous in memory Cache line = 4 texels v mip-map level d+1 texels u mip-map level d texels

Now rotate triangle on screen Assume: Row-major raster order Horizontal texels contiguous in memory Cache line = 4 texels u mip-map level d+1 texels v mip-map level d texels

4D blocking (texture is 2D array of 2D blocks: robust to triangle orientation) Assume: Row-major raster order 2D blocks of texels contiguous in memory Cache line = 4 texels u mip-map level d+1 texels v mip-map level d texels

Tiled rasterization increases reuse Assume: Blocked raster order 2D blocks of texels contiguous in memory Cache line = 4 texels u mip-map level d+1 texels v mip-map level d texels

6D blocking further reduces conflicts contiguous cache-sized block

Blocked texture formats Render-to-texture challenge: - Frame-buffer had a preferred format - Textures had a preferred format - Costly to convert between the two These days: - Declare usage for buffers at allocation in API - In general, standard blocking schemes across the board

Key metric Unique texel to fragment ratio - Number of unique texels accessed when rendering a scene / number of fragments processed [see Igeny reading for stats: often less than < 1] - Worse case? In reality, caching behavior is good, but not CPU workload good - Design for 90% hits [Montrym & Moreton 95] - GPUs require high memory bandwidth

Memory latency Request texture data. Processor waits for hundreds of cycles. (Very bad) Recall: GPUs will miss the cache a lot more than CPUs (fundamental to the streaming workload) Solution prior to programmable shading: prefetch - Today s reading: Igehy et al. Prefetching in a Texture Cache Architecture Solution in modern programmable GPUs: thread - Next time

Large fragment FIFOs Note: This diagram does not contain a texture cache. See reading for implementation of prefetching with caching. Slide credit: Akeley and Hanrahan

Texture summary Pre-filtering texture data reduces aliasing - Mip-mapping fundamental to texture system design A texture lookup is a lot more than a 2D array access - Significant computational expense, implemented in specialized fixed-function hardware GPU texture caches: - Primarily serve to amplify limited DRAM bandwidth - Not to reduce latency to off-chip memory - Small in size, multi-ported (e.g., need to access 8 texels simultaneously) Tiled rasterization order, tiled texture layout serve to increase cache hits Texture access latency is hidden by prefetching (in the old days) and multi-threading (in modern GPUs) - The design of a modern GPU processing core is influenced heavily by the need to hide texture access latency (next time)

Readings Z. Hakura and a. Gupta, The Design and Analysis of a Cache Architecture for Texture Mapping. ISCA 97 H. Igehy et al., Prefetching in a Texture Cache Architecture. Graphics Hardware 1998