Real-Time Buffer Compression. Michael Doggett Department of Computer Science Lund university

Similar documents
Performance Analysis and Culling Algorithms

Parallel Triangle Rendering on a Modern GPU

Architectures. Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1

Graphics Architectures and OpenCL. Michael Doggett Department of Computer Science Lund university

Lecture 6: Texture. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Lecture 4: Visibility. (coverage and occlusion using rasterization and the Z-buffer) Visual Computing Systems CMU , Fall 2014

Spring 2009 Prof. Hyesoon Kim

POWERVR MBX. Technology Overview

GeForce4. John Montrym Henry Moreton

Lecture 6: Texturing Part II: Texture Compression and GPU Latency Hiding Mechanisms. Visual Computing Systems CMU , Fall 2014

Spring 2011 Prof. Hyesoon Kim

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager

ECE 571 Advanced Microprocessor-Based Design Lecture 20

RSX Best Practices. Mark Cerny, Cerny Games David Simpson, Naughty Dog Jon Olick, Naughty Dog

Optimizing for DirectX Graphics. Richard Huddy European Developer Relations Manager

Graphics Processing Unit Architecture (GPU Arch)

ECE 571 Advanced Microprocessor-Based Design Lecture 18

GPU Architecture. Michael Doggett Department of Computer Science Lund university

CS4620/5620: Lecture 14 Pipeline

Addressing the Memory Wall

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Hardware-driven visibility culling

Scanline Rendering 2 1/42

graphics pipeline computer graphics graphics pipeline 2009 fabio pellacini 1

graphics pipeline computer graphics graphics pipeline 2009 fabio pellacini 1

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

Pipeline Operations. CS 4620 Lecture 10

The Rasterization Pipeline

Project 1, 467. (Note: This is not a graphics class. It is ok if your rendering has some flaws, like those gaps in the teapot image above ;-)

Next-Generation Graphics on Larrabee. Tim Foley Intel Corp

Course Recap + 3D Graphics on Mobile GPUs

ARM Multimedia IP: working together to drive down system power and bandwidth

The NVIDIA GeForce 8800 GPU

ECE 574 Cluster Computing Lecture 16

Line Drawing. Introduction to Computer Graphics Torsten Möller / Mike Phillips. Machiraju/Zhang/Möller

Graphics Performance Optimisation. John Spitzer Director of European Developer Technology

Efficient and Scalable Shading for Many Lights

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Real-Time Graphics Architecture

Hardware-driven Visibility Culling Jeong Hyun Kim

LPGPU Workshop on Power-Efficient GPU and Many-core Computing (PEGPUM 2014)

GPU Architecture. Samuli Laine NVIDIA Research

Administrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know.

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

Real-Time Rendering (Echtzeitgraphik) Michael Wimmer

E.Order of Operations

Current Trends in Computer Graphics Hardware

Spring 2010 Prof. Hyesoon Kim. AMD presentations from Richard Huddy and Michael Doggett

Slide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, DRAM Bandwidth

2.11 Particle Systems

Pipeline Operations. CS 4620 Lecture 14

Lecture 2. Shaders, GLSL and GPGPU

Monday Morning. Graphics Hardware

How to Work on Next Gen Effects Now: Bridging DX10 and DX9. Guennadi Riguer ATI Technologies

The F-Buffer: A Rasterization-Order FIFO Buffer for Multi-Pass Rendering. Bill Mark and Kekoa Proudfoot. Stanford University

Mobile Performance Tools and GPU Performance Tuning. Lars M. Bishop, NVIDIA Handheld DevTech Jason Allen, NVIDIA Handheld DevTools

Working with Metal Overview

What Next? Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University. * slides thanks to Kavita Bala & many others

1. Introduction 2. Methods for I/O Operations 3. Buses 4. Liquid Crystal Displays 5. Other Types of Displays 6. Graphics Adapters 7.

Chapter 2: Memory Hierarchy Design (Part 3) Introduction Caches Main Memory (Section 2.2) Virtual Memory (Section 2.4, Appendix B.4, B.

GeForce3 OpenGL Performance. John Spitzer

A Bandwidth Effective Rendering Scheme for 3D Texture-based Volume Visualization on GPU

ECE 571 Advanced Microprocessor-Based Design Lecture 21

Morphological: Sub-pixel Morhpological Anti-Aliasing [Jimenez 11] Fast AproXimatte Anti Aliasing [Lottes 09]

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

Graphics Hardware. Instructor Stephen J. Guy

Optimizing and Profiling Unity Games for Mobile Platforms. Angelo Theodorou Senior Software Engineer, MPG Gamelab 2014, 25 th -27 th June

Cornell University CS 569: Interactive Computer Graphics. Introduction. Lecture 1. [John C. Stone, UIUC] NASA. University of Calgary

High-Quality Surface Splatting on Today s GPUs

Pipeline Operations. CS 4620 Lecture Steve Marschner. Cornell CS4620 Spring 2018 Lecture 11

28 SAMPLING. ALIASING AND ANTI-ALIASING

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance

Real-Time Graphics Architecture

Graphics Pipeline 2D Geometric Transformations

CENG 477 Introduction to Computer Graphics. Graphics Hardware and OpenGL

Rendering. Converting a 3D scene to a 2D image. Camera. Light. Rendering. View Plane

Point Sample Rendering

Lecture 9: Deferred Shading. Visual Computing Systems CMU , Fall 2013

X. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1

CS 354R: Computer Game Technology

Optimizing Games for ATI s IMAGEON Aaftab Munshi. 3D Architect ATI Research

Case 1:17-cv SLR Document 1-3 Filed 01/23/17 Page 1 of 33 PageID #: 60 EXHIBIT C

Line Drawing. Foundations of Computer Graphics Torsten Möller

CS 432 Interactive Computer Graphics

Mattan Erez. The University of Texas at Austin

GPUs and GPGPUs. Greg Blanton John T. Lubia

Drawing Fast The Graphics Pipeline

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. What is Computer Architecture? Sources

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CS427 Multicore Architecture and Parallel Computing

Rendering Grass with Instancing in DirectX* 10

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Real-Time Reyes: Programmable Pipelines and Research Challenges. Anjul Patney University of California, Davis

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

Building scalable 3D applications. Ville Miettinen Hybrid Graphics

Introduction to Computer Graphics (CS602) Lecture No 03 Graphics Systems

Memory Hierarchies &

Transcription:

Real-Time Buffer Compression Michael Doggett Department of Computer Science Lund university

Project 3D graphics project Demo, Game Implement 3D graphics algorithm(s) C++/OpenGL(Lab2)/iOS/android/3D engine Proposal - Long paragraph by next Thursday More in the next lecture 2010 Michael Doggett

Stages we have looked at so far Vertex shader Rasterization Pixel shader Texture Z & Alpha FrameBuffer 2010 Michael Doggett

Today s stages of the Graphics Pipeline Vertex shader Rasterization Z and Color Compression Pixel shader Z & Alpha Texture FrameBuffer 2010 Michael Doggett

Z & Alpha Performance Recall Memory Bandwidth determines performance Both units connect directly to memory Computation power of GPUs and CPUs increasing more rapidly than memory bandwidth Compression reduces the data before we transfer it 2012 Michael Doggett

DRAM overview Dynamic Random Access Memory Must have power and be refreshed to maintain it Discrete GPU memory - Fast and reasonably priced Many different types : SDRAM, VRAM, SGRAM, etc. Multiple improvements of data transfer DDR - sends data on both low-to-high and high-to-low QDR, then GDDR (Graphics DDR) versions 2, 3, 4, 5 HBM - High Bandwidth Memory 3D-stacked DRAM mmxvi mcd

PC memory architecture Integrated CPU and GPU Intel Sandy/Ivy Bridge/ Haswell/Skylake AMD Fusion CPU GPU 8GB/s PCIe2 16x GPU NVIDIA GTX980 12.8GB/s 224GB/s GDDR5 System DRAM Graphics DRAM 2012 Michael Doggett

Mobile Memory Architecture CPU Video Cache Phone ~5 GB/s GPU System DRAM 1GB Storage : Flash Memory 16-64GB Based on 2014 iphone5s mmxiii mcd source http://anandtech.com/show/7335/the-iphone-5s-review

Back to Graphics Hardware algorithms

Why depth buffer? [Slide courtesy of John Owens] The brute-force approach is depth buffering (aka Z-buffering): It won over sorting-polygon-methods because memory became ridiculously inexpensive...

Depth buffer bandwidth Still could be quite expensive! Zmin/Zmax-culling helped (previous lecture) Real-Time Buffer Compression can help reduce Depth buffer bandwidth Color buffer bandwidth Other buffers...

Real-Time Buffer Compression Techniques that are or may be used in GPUs... Basic idea: Lots of coherency (correlation) between pixels Use that to compress buffer info Send compressed buffer info over the bus Special hardware handles compression and decompression on-the-fly Must be lossless!!

General Compression System GPU GPU Memory

Compression System Works on a tile basis Eg 8x8 pixels at a time Cache is important! Do not want to decompress tile for every fragment that needs access to values in that tile Tile table store per-tile info : E.g., which compression mode is used Example: 00 is uncompressed, 01 is compressed at 25%, 10 is at 50%, 11 is cleared Always needs one uncompressed mode as a fallback

Example Read request! ctrl block Checks cache If there, deliver immediately If not Evict one tile from cache by attempting to compress info, and sending resulting representation, update tile table for that tile Check tile table for requested tile, and Read appropriate amount of bytes Decompress (or send cleared info without reading, or in case of data being uncompressed, no decompression needed) Done

Dirty bit Each tile in cache has one bit for this When new info has been written to a tile in cache, set dirty bit=1 When a tile in the cache needs to be evicted, check dirty bit If =0, information in external memory is up to date! no need to write back! If =1, attempt to compress, and send to external memory Saves a lot when no updates Example: particle systems do not write depth!

Depth buffer compression Hard to get accurate information about this Looking at patents we can extract some ideas Three techniques: Depth offset compression DPCM compression Layered plane equation compression

Depth buffer compression Simplest buffer to compress Highly coherent info (big triangles WRT tile size) Depth is linear in screen space Depth cache and depth compression helps Zmax update for Zmax-culling Depths, d(i,j) per tile, i is in [0, w-1], j is in [0, h-1] Min depth value is 00...00 b (all zeroes, e.g. 24 bits) Max depth value is 11...11 b (all ones) i.e., we can use integer math

Depth offset compression (1) Identify a set of reference values, r k, and compress each depth as an offset with respect to one reference value Easiest to only use two reference values Use Zmin and Zmax of tile! Rationale: we have two layers One with depths close to Zmin and one with depths close to Zmax

Visually, this means... Can encode if all z-values are in the gray regions 20

Depth offset compression (2) Use an offset range of t=2 p Can use offset, o(i,j), per pixel as: If at least one pixel, (i,j), cannot fulfil the above, the tile cannot be compressed! Info to store (if compression possible): Zmin and Zmax Plus wxh p-bit values

Depth offset compression (3) Example with following assumptions: 8x8 pixels per tile t=2 8 means 8 bits per offset, o 24 bits depth! Zmin and Zmax has 24 bits each Storage (uncompressed: 8x8x3=192 bytes): 1 bit per pixel to indicate whether offset to Zmin or Zmax! 8x8x1 bits= 8 bytes Offsets: 8x8x8 bits= 64 bytes Zmin & Zmax: 6 bytes (might be on-chip though) Total: 8+64+6=78 bytes! 100*78/192=41% compression

Less expensive implementation Only possible to compress if all depths in a tile are in gray regions There are some extensions to this that makes the hardware simpler! Make the offset computation inexpensive! Currently costs an adder in HW

Inexpensive offsets... Instead of storing exactly Zmin and Zmax, store only m most significant bits (MSBs) Call these truncated values, u min and u max Offset is now simply the k-m least significant bits of depth (no add needed) m bits u min k-m bits Offset, o(i,j) k bits (eg 24 bits)

Disadvantage of cheap offsets, and a solution Only values in dark gray area can be coded! loss of compressibility! Simple solution: use one more bit per pixel! four reference values: u min, u min +1, u max -1 and u max

Decompression hardware Very simple

Compression ratio with inexpensive variant Slightly worse! 44% instead of 41% But, range of offsets is larger! Best case: range is twice as large Worst case: range is only one depth value larger Average case: range is about 50% larger! So more tiles can be compressed, but still costs more

DPCM Compression DPCM=differential pulse code modulation Basic idea: we usually have linearly varying values in tile Second derivative of linear function is zero! However, we have discretized function, so need discretized second derivatives

DPCM: Focus on one column of depths d 0 d 0 d 0 d 1 s 1 s 1 d 2 d 3 Compute slopes, s 2 s 3 Compute differential Δs 2 Δs 3 d 4 s i = d i -d i-1 s 4 slopes, Δs i = s i -s i-1 Δs 4 d 5 s 5 Δs 5 d 6 s 6 Δs 6 d 7 s 7 Δs 7 For linear functions, the Δs s will be close to 0 Reconstruction is simple (next slide)

DPCM reconstruction From definition, we get: Only s 1 is known, but:

Process each column independently Not ideal: still many d s and s s Process first two rows similarly!

DPCM: How to store this? One depth, d, two slopes, s, and 61 Δs The Δs are small, in [-1,+1] inside triangle Use two extra bits per pixel: 00: add 0 01: add +1 10: add -1 11: use as escape code to handle extraordinary cases... Best case compression (no escapes at all): 24 bits + 25 +25 +8x8x2 ~= 25 bytes (13% ratio) If a single triangle covers entire tile Do not need the 11-escape case then though...

DPCM: common case Single column: Depths:1,2,3,4,8,10,12,14 Slopes: 1,1,1,4,2,2,2 Diff slopes: 0,0,3,-2,0,0 Two escape codes needed per column to change from one plane eq (tri) to another Becomes expensive! 40% compression ratio Solution: encode from the top & down and from bottom & up Store also where transition happens Gives about 20% compression ratio! Might be possible to use fewer bits per slope Can only handle two plane equations per tile Still does not use escape

Plane Equation Compression Each triangle can be represented as a plane For every triangle in a tile store the triangle s plane equation Store one depth in center of tile, and an x-slope (dz/dx), and y-slope (dy/dz) across the tile For every pixel in the tile store an index to find the matching plane equation Works great for multisample! Random access only decompress necessary pixels More info [VanHook07] US Patent 7,242,400 2010 Michael Doggett

Plane Equation Compression Plane 0 : Zc, x slope, y slope Plane 1 : Zc, x slope, y slope Plane 2 : Zc, x slope, y slope Plane Equations 3 x (3 + 2 + 2)Bytes = 21Bytes Indexes 64 x 2bits = 16Bytes Compressed 37Bytes Uncompressed 64 x 3Bytes = 192Bytes Compression ratio 19% 1 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 2 0 0 0 1 1 1 2 2 2 0 0 1 1 2 2 2 2 2 0 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2010 Michael Doggett

Color Buffer Compression Could use offset compression for R, G, and B separately (perhaps) Could use JPG s non-lossy algorithms Can do simple color compression for multisample anti-aliasing Can compress clear color Is generally very difficult due to restrictions Cannot be lossy Must decode very fast for alpha blending

Conclusion Compression reduces bandwidth further Several options for depth For more details read Notes chapter 7 Harder for color Needs cache Needs fallback for non-compressed mode

Next... Today and Friday Assignment 2 marking in Uranus Next lecture Antialiasing Texture Compression Start project Next week : GPU Architecture Graphics Architecture and OpenCL 2010 Michael Doggett