Lecture 6: Texturing Part II: Texture Compression and GPU Latency Hiding Mechanisms. Visual Computing Systems CMU , Fall 2014

Similar documents
Lecture 6: Texture. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Texture Compression. Jacob Ström, Ericsson Research

PVRTC & Texture Compression. User Guide

The Rasterization Pipeline

GeForce4. John Montrym Henry Moreton

Perspective Projection and Texture Mapping

PVRTC Specification and User Guide

On improving the performance per area of ASTC with a multi-output decoder

Lecture 9: Deferred Shading. Visual Computing Systems CMU , Fall 2013

Real-Time Buffer Compression. Michael Doggett Department of Computer Science Lund university

Using Virtual Texturing to Handle Massive Texture Data

Parallel Triangle Rendering on a Modern GPU

CS GPU and GPGPU Programming Lecture 11: GPU Texturing 1. Markus Hadwiger, KAUST

PowerVR Series5. Architecture Guide for Developers

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager

POWERVR MBX. Technology Overview

PowerVR Hardware. Architecture Overview for Developers

Real-Time Graphics Architecture. Kurt Akeley Pat Hanrahan. Texture

CS GPU and GPGPU Programming Lecture 12: GPU Texturing 1. Markus Hadwiger, KAUST

3D Graphics Texture Compression And Its Recent Trends.

CS GPU and GPGPU Programming Lecture 16+17: GPU Texturing 1+2. Markus Hadwiger, KAUST

Graphics Processing Unit Architecture (GPU Arch)

Performance Analysis and Culling Algorithms

Compression Part 2 Lossy Image Compression (JPEG) Norm Zeck

CS427 Multicore Architecture and Parallel Computing

Saving the Planet Designing Low-Power, Low-Bandwidth GPUs

Programming Graphics Hardware

Texture. Real-Time Graphics Architecture. Kurt Akeley Pat Hanrahan.

Adaptive Scalable Texture Compression

Architectures. Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1

PowerVR Performance Recommendations. The Golden Rules

Optimizing for DirectX Graphics. Richard Huddy European Developer Relations Manager

Lecture 4: Visibility. (coverage and occlusion using rasterization and the Z-buffer) Visual Computing Systems CMU , Fall 2014

ASTC Does It. Eason Tang Staff Applications Engineer, ARM

Image Processing Tricks in OpenGL. Simon Green NVIDIA Corporation

Course Recap + 3D Graphics on Mobile GPUs

Features. Sequential encoding. Progressive encoding. Hierarchical encoding. Lossless encoding using a different strategy

Jingyi Yu CISC 849. Department of Computer and Information Science

GPU-friendly Data Compression. GTC, 22.Mar.2013 Jens Schneider, GMSV, KAUST

Enhancing Traditional Rasterization Graphics with Ray Tracing. March 2015

Working with Metal Overview

I ll be speaking today about BC6H compression, how it works, how to compress it in realtime and what s the motivation behind doing that.

PowerVR. Performance Recommendations

Squeezing Performance out of your Game with ATI Developer Performance Tools and Optimization Techniques

Bringing AAA graphics to mobile platforms. Niklas Smedberg Senior Engine Programmer, Epic Games

Table-based Alpha Compression

Spring 2009 Prof. Hyesoon Kim

Morphological: Sub-pixel Morhpological Anti-Aliasing [Jimenez 11] Fast AproXimatte Anti Aliasing [Lottes 09]

Mobile HW and Bandwidth

H.264 Decoding. University of Central Florida

TEXTURE COMPRESSION IN MEMORY- AND PERFORMANCE-CONSTRAINED EMBEDDED SYSTEMS

Optimizing and Profiling Unity Games for Mobile Platforms. Angelo Theodorou Senior Software Engineer, MPG Gamelab 2014, 25 th -27 th June

Spring 2011 Prof. Hyesoon Kim

Real-Time Rendering (Echtzeitgraphik) Michael Wimmer

CS 498 VR. Lecture 19-4/9/18. go.illinois.edu/vrlect19

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Biomedical signal and image processing (Course ) Lect. 5. Principles of signal and image coding. Classification of coding methods.

Real-World Applications of Computer Arithmetic

Mattan Erez. The University of Texas at Austin

Vulkan: Architecture positive How Vulkan maps to PowerVR GPUs Kevin sun Lead Developer Support Engineer, APAC PowerVR Graphics.

Addressing the Memory Wall

INF5063: Programming heterogeneous multi-core processors. September 17, 2010

Rendering. Converting a 3D scene to a 2D image. Camera. Light. Rendering. View Plane

Towards copy-evident JPEG images

ECE 417 Guest Lecture Video Compression in MPEG-1/2/4. Min-Hsuan Tsai Apr 02, 2013

Graphics Architectures and OpenCL. Michael Doggett Department of Computer Science Lund university

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Wireless Communication

VC 12/13 T16 Video Compression

Multimedia Decoder Using the Nios II Processor

Chroma Adjustment for HDR Video. Jacob Ström, Per Wennersten Ericsson Research

LPGPU Workshop on Power-Efficient GPU and Many-core Computing (PEGPUM 2014)

PowerVR Performance Recommendations The Golden Rules. October 2015

IMAGE COMPRESSION USING FOURIER TRANSFORMS

Texture Caching. Héctor Antonio Villa Martínez Universidad de Sonora

Cache Design for a Hardware Accelerated Sparse Texture Storage System

Ray Tracing. Computer Graphics CMU /15-662, Fall 2016

Scheduling the Graphics Pipeline on a GPU

Lecture 25: Board Notes: Threads and GPUs

Interframe coding A video scene captured as a sequence of frames can be efficiently coded by estimating and compensating for motion between frames pri

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

Portland State University ECE 588/688. Graphics Processors

CS101 Lecture 12: Image Compression. What You ll Learn Today

Windowing System on a 3D Pipeline. February 2005

The Application Stage. The Game Loop, Resource Management and Renderer Design

PowerVR Performance Recommendations. The Golden Rules

The Rasterization Pipeline

Page 1. Multilevel Memories (Improving performance using a little cash )

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Render-To-Texture Caching. D. Sim Dietrich Jr.

Evolution of GPUs Chris Seitz

Chapter IV Fragment Processing and Output Merging. 3D Graphics for Game Programming

Lets assume each object has a defined colour. Hence our illumination model is looks unrealistic.

Texture Caches. Michael Doggett Department of Computer Science Lund University Sweden mike (at) cs.lth.se

Mattan Erez. The University of Texas at Austin

From Shader Code to a Teraflop: How Shader Cores Work

Robert Matthew Buckley. Nova Southeastern University. Dr. Laszlo. MCIS625 On Line. Module 2 Graphics File Format Essay

Module Introduction. Content 15 pages 2 questions. Learning Time 25 minutes

Lecture 4: Geometry Processing. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Transcription:

Lecture 6: Texturing Part II: Texture Compression and GPU Latency Hiding Mechanisms Visual Computing Systems

Review: mechanisms to reduce aliasing in the graphics pipeline When sampling visibility?! - Supersampling: sample coverage signal densely (multiple times per pixel)! When sampling shading? (appearance) - Prefiltering: remove high frequencies from texture signal prior to shading

Review: operations in a texture fetch For each texture fetch in a shader program: 1. Compute du/dx, du/dy, dv/dx, dv/dy differentials from quad fragment 2. Compute mip-map level: d (for tri-linear filtering) 3. Convert normalized texture coordinate uv to texel coordinates: tu,tv 4. Compute required texels ** 5. If texture data in filter footprint (eight texels for trilinear filtering) is not in cache: - Load compressed texel data from DRAM - Decompress texel data 6. Perform tri-linear interpolation according to (tu,tv,d) to get filtered texture sample A texture fetch involves both data access and also significant amounts of computation: all modern GPUs have dedicated fixed-function hardware support for texture sampling and texture decompression. ** May involve wrap, clamp, etc. of texel coordinates according to sampling mode configuration

Texture data access characteristics Key metric: unique texel-to-fragment ratio - Number of unique texels accessed when rendering a scene divided by number of fragments processed [see Igeny reading for stats: often less than < 1] - What is the worst-case ratio? (assuming trilinear filtering) - How can incorrect computation of texture miplevel (d) affect this? In practice, caching behavior is good, but not CPU workload good - [Montrym & Moreton 95] design for 90% hits - Why? (only so much spatial locality) Implications - GPU must provide high memory bandwidth for texture data access - GPU must have solution for hiding memory access latency - GPU must reduce its bandwidth requirements using caching and texture compression

Texture compression

Texture compression Goal: reduce bandwidth requirements of texture access Texture is read-only data - Compression can be performed off-line, so compression algorithms can take significantly longer than decompression (decompression must be fast!) - Lossy compression schemes are permissible Design requirements - Support random texel access (constant time access to any texel) - High-performance decompression - Simple algorithms (low-cost hardware implementation) - High compression ratio - High visual quality (lossy is okay, but cannot lose too much!)

Simple scheme: color palette (indexed color) Lossless (if image contains a small number of unique colors) Color palette (eight colors) 0 1 2 3 4 5 6 7 Image encoding in this example: 3 bits per texel + eight RGB values in palette (8x24 bits) 0 1 3 6 0 2 6 7 1 4 6 7 4 5 6 7 What is the compression ratio?

Per-block palette Block-based compression scheme on 4x4 texel blocks - Idea: there might be many unique colors across an entire image, but can approximate all values in any 4x4 texel region using only a few unique colors Per-block palette (e.g., four colors in palette) - 12 bytes for palette (assume 24 bits per RGB color: 8-8-8) - 2 bits per texel (4 bytes for per-texel indices) - 16 bytes (3x compression on original data: 16x3=48 bytes)! Can we do better?

S3TC (Called BC1 or DXTC by Direct3D) Palette of four colors encoded in four bytes: - Two low-precision base colors: C 0 and C 1 (2 bytes each: RGB 5-6-5 format) - Other two colors computed from base values - 1 / 3 C 0 + 2 / 3 C 1-2 / 3 C 0 + 1 / 3 C 1 Total footprint of 4x4 texel block: 8 bytes - 4 bytes for palette, 4 bytes of color ids (16 texels, 2 bits per texel) - 4 bpp effective rate, 6:1 compression ratio (fixed ratio: independent of data values) S3TC assumption: - All texels in a 4x4 block lie on a line in RGB color space Additional mode: - If C0 < C1, then third color is 1 / 2 C 0 + 1 / 2 C 1 and fourth color is transparent black

S3TC artifacts Original Original (Zoom) S3TC Original data Compressed result Cannot interpolate red and blue to get green (here compressor chose blue and yellow as base colors to minimize overall error)!! But scheme works well in practice on real-world images. (see images at right) Image credit: http://renderingpipeline.com/2012/07/texture-compression/ [Strom et al. 2007]

Y CbCr color space Y = luma: perceived (gamma corrected) luminance Cb = blue-yellow deviation from gray Cr = red-cyan deviation from gray Y Cb Conversion from R G B to Y CbCr: Cr Image credit: Wikipedia

Demo Original picture of Kayvon

Demo Color channels downsampled by a factor of 20 in each dimension (400x reduction in number of samples)

Demo Full resolution sampling of luminance

Demo Reconstructed result

Chroma subsampling Y CbCr is an efficient representation for storage (and transmission) because Y can be stored at higher resolution than CbCr without significant loss in perceived visual quality Y 00 Y 10 Y 20 Y 30 Cb 00 Cb 20 Cr 00 Cr 20 Y 01 Y 11 Y 21 Y 31 Cb 01 Cb 21 Cr 01 Cr 21 Y 00 Y 10 Y 20 Y 30 Cb 00 Cb 20 Cr 00 Cr 20 Y 01 Y 11 Y 21 Y 31 4:2:2 representation: Store Y at full resolution Store Cb, Cr at full vertical resolution, but only half horizontal resolution 4:2:0 representation: Store Y at full resolution Store Cb, Cr at half resolution in both dimensions.

PACKMAN [Strom et al. 2004] Block-based compression on 2x4 texel blocks - Idea: vary luminance per texel, but specify color per block Each block encoded as: - A single base color per block (12 bits: RGB 4-4-4) - 4-bit index identifying one of 16 predefined luminance modulation tables - Per-texel 2-bit index into luminance modulation table (8x2=16 bits) - Total block size = 12 + 4 + 16 = 32 bits (6:1 compression ratio) Decompression: - texel[i] = base_color + table[table_id][table_index[i]]; Example codebook for modulation tables (8 of 16 tables shown)

ipackman (ETC) [Strom et al. 2005] Improves on problems of heavily quantized and sparsely represented chrominance in PACKMAN - Higher resolution base color + differential color represents color more accurately Operates on 4x4 texel blocks - Optionally represent 4x4 block as two eight-texel subblocks with differentials (else use PACKMAN for two subblocks) - 1 bit designates whether differential scheme is in use - Base color for first block (RGB 5-5-5: 15 bits) - Color differential for second block (RGB 3-3-3: 9 bits) - 1 bit designating if subblocks are 4x2 or 2x4-3-bit index identifying modulation table per subblock (2x3 bits) - Per-texel modulation table index (2x16 bits) Base RGB555 Delta RGB333 - Total compressed block size: 1 + 15 + 9 + 1 + 6 + 32 = 64 bits (6:1 ratio)

PACKMAN vs. ipackman quality comparison Original PACMAN ipackman Chrominance banding Chrominance block artifact Image credit: Strom et al. 2005

PVRTC (Power VR texture compression) Not a block-based format - Used in Imagination PowerVR GPUs [Fenney et al. 2003] Store low-frequency base images A and B - Base images downsampled by factor of 4 in each dimension ( 1 / 16 fewer texels) - Store base image pixels in RGB 5:5:5 format (+ 1 bit alpha) Store 2-bit modulation factor per texel Total footprint: 4 bpp (6:1 ratio)

PVRTC [Fenney et al. 2003] Decompression algorithm: - Bilinear interpolate samples from A and B (upsample) to get value at desired texel - Interpolate upsampled values according to 2-bit modulation factor

PVRTC avoids blocking artifacts Because it is not block-based!! Recall: decompression algorithm involves bilinear upsampling of low-resolution base images! (Followed by a weighted combination of the two images) PVRTC Image credit: Fenney et al. 2003

Summary: texture compression Many schemes target 6:1 fixed compression ratio (4 bpp) - Predictable performance - 8 bytes per 4x4-texel block is desirable for memory transfers Lossy compression techniques - Exploit characteristics of the human visual system to minimize perceived error - Texture data is read only, so drift due to multiple reads/writes is not a concern Block-based vs. not-block based - Block-based: S3TC/DXTC/BC1, ipackman/etc/etc2, ASTC (not discussed today) - Not-block-based: PVRTC We only discussed decompression today: - Compression can be performed off-line (except when textures are generated at runtime e.g., reflectance maps)

Hiding the latency of texture sampling and texel data access

Texture sampling is a high-latency operation For each texture fetch in a shader program:! 1. Compute du/dx, du/dy, dv/dx, dv/dy differentials from quad fragment 2. Compute mip-map level: d (for tri-linear filtering) 3. Convert normalized texture coordinate uv to texel coordinates: (tu,tv) 4. Compute required texel addresses 5. If texture data in filter footprint is not in cache (recall: GPUs miss cache often) - Fetch texel data from DRAM - Decompress texel data for storage in texture cache 6. Perform tri-linear interpolation according to (tu,tv,d) to get filtered texture sample Latency of texture fetch involves time to perform math for texel address computation, decompression, and filtering (not just latency of fetching data from memory)

Addressing texture sampling latency Processor requests filtered texture data processor waits hundreds of cycles (significant loss of performance)! Solution prior to programmable GPU cores: texture data prefetching - Today s reading: Igehy et al. Prefetching in a Texture Cache Architecture! Solution in all modern GPUs: multi-threaded processor cores - Will omit today, but will discuss in detail in a future lecture

Prefetching example: large fragment FIFOs Rasterization Texture prefetching (from Igehy 1998) Texel addresses Texel cache tags (texel ids) Fragment FIFO (coverage, Z, attribs)! Note: fragment FIFO must be large! Why? Cache addresses Memory request fifo Memory System Cache addresses Memory reorder buffer Texel cache data Texture Filtering Texel data

A more modern design texture request: (u,v, du, dv, lod) Texel address computation Texel addresses Texel cache tags (texel ids) Memory request fifo Programmable GPU Core Cache addresses Texture request fifo Memory System Cache addresses Memory reorder buffer filtered texture result: rgba Texel cache data Texel Filtering Texel data Texture Sampling Unit

Modern GPUs: texture latency is hidden via hardware multi-threading Multi-threaded GPU Core texture request: (u,v, du, dv, lod) texel data request Exec Context 0 Exec Context 1 Exec Context 2. filtered texture result: rgba Texture Sampling Unit texel data Memory System Exec Context 63 GPU executes instructions from runnable fragments when other fragments are waiting on texture sampling responses.! Fragment FIFO from Igehy prefetching design is now represented by live fragment state in the programmable core.

Texture system summary A texture lookup is a lot more than a 2D array access - Significant computational and bandwidth expense - Implemented in specialized fixed-function hardware Bandwidth reduction mechanism: GPU texture caches - Primarily serve to amplify limited DRAM bandwidth, not reduce latency to off-chip memory - Small capacity compared to CPU caches, but high BW (need eight texels at once) - Tiled rasterization order + tiled texture layout optimizations increase cache hits Bandwidth reduction mechanism: texture compression - Lossy compression schemes - Fixed-compression ratio encodings (e.g, 6:1 ratio, 4 bpp is common for RGB data) - Schemes permit random access into compressed representation! Latency avoidance/hiding mechanisms: - Prefetching (in the old days) - Multi-threading (in modern GPUs)