Using Virtual Texturing to Handle Massive Texture Data

Similar documents
Overview. Videos are everywhere. But can take up large amounts of resources. Exploit redundancy to reduce file size

JPEG decoding using end of block markers to concurrently partition channels on a GPU. Patrick Chieppe (u ) Supervisor: Dr.

Lecture 6: Texturing Part II: Texture Compression and GPU Latency Hiding Mechanisms. Visual Computing Systems CMU , Fall 2014

Lecture 6: Texture. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Adaptive Point Cloud Rendering

Real-Time Reyes: Programmable Pipelines and Research Challenges. Anjul Patney University of California, Davis

Introduction to Video Compression

Scalable Multi Agent Simulation on the GPU. Avi Bleiweiss NVIDIA Corporation San Jose, 2009

Image Processing Tricks in OpenGL. Simon Green NVIDIA Corporation

MAPPING VIDEO CODECS TO HETEROGENEOUS ARCHITECTURES. Mauricio Alvarez-Mesa Techische Universität Berlin - Spin Digital MULTIPROG 2015

Introduction ti to JPEG

Video Compression An Introduction

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

Digital Video Processing

Lecture 8 JPEG Compression (Part 3)

H.264 Decoding. University of Central Florida

PowerVR Hardware. Architecture Overview for Developers

! Readings! ! Room-level, on-chip! vs.!

Many rendering scenarios, such as battle scenes or urban environments, require rendering of large numbers of autonomous characters.

Administrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know.

Mattan Erez. The University of Texas at Austin

Multimedia Systems Image III (Image Compression, JPEG) Mahdi Amiri April 2011 Sharif University of Technology

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

Optimizing for DirectX Graphics. Richard Huddy European Developer Relations Manager

Mattan Erez. The University of Texas at Austin

Real-Time Rendering Architectures

CS GPU and GPGPU Programming Lecture 2: Introduction; GPU Architecture 1. Markus Hadwiger, KAUST

Real-Time YCoCg-DXT Compression

From Shader Code to a Teraflop: How Shader Cores Work

PVRTC & Texture Compression. User Guide

B. Tech. Project Second Stage Report on

Height field ambient occlusion using CUDA

EE 5359 H.264 to VC 1 Transcoding

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

ECE 417 Guest Lecture Video Compression in MPEG-1/2/4. Min-Hsuan Tsai Apr 02, 2013

Real-Time Buffer Compression. Michael Doggett Department of Computer Science Lund university

CUDA Performance Optimization. Patrick Legresley

Improving Memory Space Efficiency of Kd-tree for Real-time Ray Tracing Byeongjun Choi, Byungjoon Chang, Insung Ihm

DHTC: An Effective DXTC-based HDR Texture Compression Scheme

Spring Prof. Hyesoon Kim

The Application Stage. The Game Loop, Resource Management and Renderer Design

Mali Developer Resources. Kevin Ho ARM Taiwan FAE

GRAPHICS PROCESSING UNITS

Texture. Real-Time Graphics Architecture. Kurt Akeley Pat Hanrahan.

How to Work on Next Gen Effects Now: Bridging DX10 and DX9. Guennadi Riguer ATI Technologies

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager

High Efficiency Video Coding. Li Li 2016/10/18

OPTIMIZED QUANTIZATION OF WAVELET SUBBANDS FOR HIGH QUALITY REAL-TIME TEXTURE COMPRESSION. Bob Andries, Jan Lemeire, Adrian Munteanu

Hybrid Implementation of 3D Kirchhoff Migration

An introduction to JPEG compression using MATLAB

CS GPU and GPGPU Programming Lecture 12: GPU Texturing 1. Markus Hadwiger, KAUST

Using animation to motivate motion

Texture Compression. Jacob Ström, Ericsson Research

Programming Graphics Hardware

Shaders. Slide credit to Prof. Zwicker

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

GPU-friendly Data Compression. GTC, 22.Mar.2013 Jens Schneider, GMSV, KAUST

Chapter 10. Basic Video Compression Techniques Introduction to Video Compression 10.2 Video Compression with Motion Compensation

Streaming Massive Environments From Zero to 200MPH

SLICING THE WORKLOAD MULTI-GPU OPENGL RENDERING APPROACHES

Lets assume each object has a defined colour. Hence our illumination model is looks unrealistic.

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CS427 Multicore Architecture and Parallel Computing

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Graphics Performance Optimisation. John Spitzer Director of European Developer Technology

Digital Image Representation Image Compression

CUDA OPTIMIZATIONS ISC 2011 Tutorial

DTC-350. VISUALmpeg PRO MPEG Analyser Software.

Optimizing and Profiling Unity Games for Mobile Platforms. Angelo Theodorou Senior Software Engineer, MPG Gamelab 2014, 25 th -27 th June

Next-Generation Graphics on Larrabee. Tim Foley Intel Corp

Spring 2011 Prof. Hyesoon Kim

Fundamental CUDA Optimization. NVIDIA Corporation

Week 14. Video Compression. Ref: Fundamentals of Multimedia

Modern Processor Architectures. L25: Modern Compiler Design

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

JPEG 2000 compression

High Performance Computing on GPUs using NVIDIA CUDA

Parallel Implementation of Arbitrary-Shaped MPEG-4 Decoder for Multiprocessor Systems

Texture Compression in Memory- and Performance- Constrained Embedded Systems. Jens Ogniewski Information Coding Group Linköpings University

POWERVR MBX. Technology Overview

Compression II: Images (JPEG)

Per-Face Texture Mapping for Realtime Rendering. Realtime Ptex

Using GPUs to compute the multilevel summation of electrostatic forces

JPEG: An Image Compression System. Nimrod Peleg update: Nov. 2003

Real-Time Graphics Architecture. Kurt Akeley Pat Hanrahan. Texture

Drawing a Crowd. David Gosselin Pedro V. Sander Jason L. Mitchell. 3D Application Research Group ATI Research

high performance medical reconstruction using stream programming paradigms

1.2.3 The Graphics Hardware Pipeline

GeForce3 OpenGL Performance. John Spitzer

Real-Time Rendering (Echtzeitgraphik) Michael Wimmer

Perceptual Coding. Lossless vs. lossy compression Perceptual models Selecting info to eliminate Quantization and entropy encoding

Render-To-Texture Caching. D. Sim Dietrich Jr.

0;L$+LJK3HUIRUPDQFH ;3URFHVVRU:LWK,QWHJUDWHG'*UDSKLFV

Portland State University ECE 588/688. Graphics Processors

The Need for Programmability

Numerical Simulation on the GPU

Architectures. Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

CS 354R: Computer Game Technology

Transcription:

Using Virtual Texturing to Handle Massive Texture Data San Jose Convention Center - Room A1 Tuesday, September, 21st, 14:00-14:50 J.M.P. Van Waveren id Software Evan Hart NVIDIA

How we describe our environment? Polygonal boundary representations convenient / compressed description of the material world Tiling / repeating / blending textures primitive forms of texture compression?

Today Polygonal boundary representations convenient / compressed description of the material world Tiling / repeating / blending textures primitive forms of texture compression?

Tonight? Polygonal boundary representations convenient / compressed description of the material world Tiling / repeating / blending textures primitive forms of texture compression?

Unique texture detail

Very large textures

Virtual Texture vs. Virtual Memory fall back to blurrier data without stalling execution lossy compression is perfectly acceptable

Universally applied virtual textures

Virtual textures with virtual pages

Physical texture with physical pages

Virtual to Physical Translation

Virtual to Physical Translation

Virtual to Physical Translation

Virtual to Physical Translation

Virtual to Physical Translation

Virtual to Physical Translation

Virtual to Physical Translation

Virtual to Physical Translation

Virtual to Physical Translation

Virtual to Physical Translation

Virtual to Physical Translation

Virtual to Physical Translation

Optimized virtual to physical translations Store complete quad-tree as a mip-mapped texture FP32x4 Use a mapping texture to store the scale and bias 8:8 + FP32x4 Calculate the scale and bias in a fragment program 8:8:8:8 5:6:5

Texture Filtering Bilinear filtering without borders Bilinear filtering with borders Trilinear filtering (mip mapped vs. two translations) Anisotropic filtering 4-texel border (max aniso= 4) explicit derivatives + TXD (texgrad) implicit derivatives works surprisingly well

Which pages need to be resident? Feedback rendering separate rendering pass or combined with depth pass factor 10 smaller is ok Feedback analysis run as parallel job on CPU run on the GPU ~.5 msec on CPU for 80 x 60

How to store huge textures? diffuse + specular + normal + alpha + power = 10 channels 128k x 128k x 3 x 8-bit RGBA = 256 GigaBytes DXT compressed (1 x DXT1 + 2 x DXT5) = 53 GigaBytes use brute force scene visibility to throw away data down to 20 50 GigaBytes uncompressed 4 10 GigaBytes DXT compressed

Need variable bit rate compression! DCT-based compression 300 800 MB HD-Photo compression 170 450 MB

What does this look like per page? 128 x 128 texels per page 120 x 120 payload + 4 texel border on all sides 192 kb uncompressed 40 kb DXT compressed 1 6 kb DCT-based or HD-Photo compressed

Can t render from variable bit rate Transcode DCT-based or HD-Photo to DXT Significant computational load 1 to 2 milliseconds per page on a single CPU core

Pipeline overview

GPU Transcoding Motivation Transcode rate tied to quality / performance Drop frames - Image is lower detail Wait for results frame rate degrades Densely occluded environment may desire in excess of 46 MTex/s DCT-based transcoding can exceed 20 ms per frame HD-Photo transcoding can exceed 50 ms per frame

Transcoding Analysis Several jobs (pages) per frame Jobs occur in several stages Entropy decode Dequantization Frequency transform Color space transformation DXT compression

Transcoding Pipeline Entropy Decode (~20-25%) Frequency Transform (25-50%) DXT Compression (25-50%)

Transcoding Breakdown Entropy Decode 20-25% CPU time Dequantization + Frequency transform 25-50% CPU time Color transform + DXT compression 25-50% CPU time

Transcoding Parallelism Entropy Decode Semi-parallel, dozens to hundreds Dequantization + Frequency transform Extremely parallel, hundreds to thousands Color transform + DXT compression Extremely parallel, hundreds to thousands

Entropy Decode Huffman based coding schemes Variable bit-width symbol Run-length encoding Serial dependencies in bit stream Substantial amount of branching

Huffman Decode Basics Bitstream Position

Huffman Decode Basics Bitstream Unique prefix Position Output Symbol + Run-length info

Huffman Decode Basics Bitstream Position

Huffman GPU Processing Long serial dependencies limit parallelism Relatively branchy (divergence) Relatively few threads Can perform reasonably with very many small streams Not the case here CPU offers better efficiency today

Frequency Transform Block-based transform from frequency domain idct of macro blocks Inherently parallel at the block level Uses NVPP derived idct kernel to batch several blocks into a single CTA Shared memory allows CTA to efficiently transition from vertical to horizontal phase

16 texels Macroblock Luminance 16 texels 8 Chrominance = + + Image broken into macro blocks 16x16 for DCT with color encoded as 4:2:0 Blocks are 8x8

CUDA idct 2D idct is separable 8x8 block has two stages 8-way parallel Too little parallelism for a single CTA Luminance and Chrominance blocks may require different quantization Group 16 blocks into a single CTA Store blocks in shared memory to enable fast redistribution between vertical and horizontal phase

idct Workload 64 Macroblocks per 128x128 RGB page 6 Blocks per macroblock (4 lum. + 2 chroma) 8 Threads per block 3072 Threads per RGB page Fills roughly 1/5 th of the GTX 480

DXT Compression DXT relies on 4x4 blocks 1024 blocks in one 128x128 image Thread per block works well There is finer parallelism, but communication can be too much Careful packing of data useful to prevent register bloat

DXT Blocks 4x4 texel block Min color, Max color, and 16 2-bit indices Max Color (16 bit) Min Color (16 bit) Indices (32 bits)

CUDA DXT Compression All operations performed on integer colors Matches CPU reference implementation Allows packing of 4 colors into a 32-bit word 4x better register utilization CTA is aligned to Macroblock boundaries Allows fetch of 4:2:0 data to shared memory for efficient memory utilization Presently 32x32 texel region

Putting it Together CPU Entropy Decode needs to work on large blocks Dozens of tasks per frame GPU kernels desire larger sets All pages as a single kernel launch is best for utilization Parameters, like quantization level and final format, can vary per page Must get data to the GPU efficiently

Solution CPU-side CPU task handles entropy decode directly to locked system memory CPU task generates tasklets for the GPU Small job headers describing the offset and parameters for a single CTA task

Solution GPU-Side Tasks broken into two natural kernels Frequency transform DXT compression Kernels read one header per CTA to guide work Offset to input / result Quantization table to use Compression format (diffuse or normal/specular)

One more thing CPU -> GPU bandwidth can be an issue Solution 1 Stage copies to happen in parallel with computation Forces an extra frame of latency Solution 2 Utilize zero copy and have frequency transform read from CPU Allows further bandwidth optimization

Split Entropy Decode Huffman coding for DCT typically truncates the coefficient matrix CPU decode can prepend a length and pack multiple matrices together GPU fetches a block of data, and uses matrix lengths to decode run-length packing Can easily save 50% of bandwidth

Run Length Decode Fetch data from system mem into shared mem Read first element as length If threadidx < 64 and threadidx < length copy Advance pointer Refill shared memory if below low water mark Repeat for all blocks

Results CPU performance increase from 20+ ms (Core i7) down to ~4 ms (Core i7) GPU costs < 3ms (GTS 450) Better image quality and/or better frame rate Particularly on moderate (2-4 core CPUs)

Conclusions Virtual Texturing offers a good method for handling large datasets Virtual texturing can benefit from GPU offload GPU can provide a 4x improvement resulting in better image quality

Thanks id Software NVIDIA Devtech