Image Processing Tricks in OpenGL. Simon Green NVIDIA Corporation

Similar documents
The GPGPU Programming Model

Enhancing Traditional Rasterization Graphics with Ray Tracing. March 2015

Overview. Videos are everywhere. But can take up large amounts of resources. Exploit redundancy to reduce file size

Graphics Processing Unit Architecture (GPU Arch)

The Rasterization Pipeline

Lecture 6: Texture. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Evaluating MMX Technology Using DSP and Multimedia Applications

Lecture 6: Texturing Part II: Texture Compression and GPU Latency Hiding Mechanisms. Visual Computing Systems CMU , Fall 2014

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

GPU-Accelerated Iterated Function Systems. Simon Green, NVIDIA Corporation

Order Independent Transparency with Dual Depth Peeling. Louis Bavoil, Kevin Myers

GPU Memory Model. Adapted from:

Using Virtual Texturing to Handle Massive Texture Data

MXwendler Fragment Shader Development Reference Version 1.0

Cloth Simulation on the GPU. Cyril Zeller NVIDIA Corporation

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

Render-To-Texture Caching. D. Sim Dietrich Jr.

Direct Rendering of Trimmed NURBS Surfaces

Evolution of GPUs Chris Seitz

HD (in) Processing. Andrés Colubri Design Media Arts dept., UCLA Broad Art Center, Suite 2275 Los Angeles, CA

DiFi: Distance Fields - Fast Computation Using Graphics Hardware

GPU Accelerating Speeded-Up Robust Features Timothy B. Terriberry, Lindley M. French, and John Helmsen

An introduction to JPEG compression using MATLAB

Spring 2009 Prof. Hyesoon Kim

Bringing AAA graphics to mobile platforms. Niklas Smedberg Senior Engine Programmer, Epic Games

Tutorial on GPU Programming #2. Joong-Youn Lee Supercomputing Center, KISTI

The Rasterization Pipeline

GPGPU: Parallel Reduction and Scan

Enhancing Traditional Rasterization Graphics with Ray Tracing. October 2015

Real-Time Reyes: Programmable Pipelines and Research Challenges. Anjul Patney University of California, Davis

GUERRILLA DEVELOP CONFERENCE JULY 07 BRIGHTON

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Soft Particles. Tristan Lorach

Filtering theory: Battling Aliasing with Antialiasing. Department of Computer Engineering Chalmers University of Technology

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

Programmable Graphics Hardware

Summed-Area Tables. And Their Application to Dynamic Glossy Environment Reflections Thorsten Scheuermann 3D Application Research Group

Working with Metal Overview

Real-Time Universal Capture Facial Animation with GPU Skin Rendering

Mattan Erez. The University of Texas at Austin

Spring 2011 Prof. Hyesoon Kim

Real-Time Rendering (Echtzeitgraphik) Michael Wimmer

x ~ Hemispheric Lighting

2D rendering takes a photo of the 2D scene with a virtual camera that selects an axis aligned rectangle from the scene. The photograph is placed into

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

CS 381 Computer Graphics, Fall 2008 Midterm Exam Solutions. The Midterm Exam was given in class on Thursday, October 23, 2008.

The Graphics Pipeline

UberFlow: A GPU-Based Particle Engine

THE REMOTE-SENSING IMAGE FUSION BASED ON GPU

Rationale for Non-Programmable Additions to OpenGL 2.0

The Application Stage. The Game Loop, Resource Management and Renderer Design

Lecture 8 JPEG Compression (Part 3)

Computer Graphics (CS 543) Lecture 10: Normal Maps, Parametrization, Tone Mapping

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

Lossy Coding 2 JPEG. Perceptual Image Coding. Discrete Cosine Transform JPEG. CS559 Lecture 9 JPEG, Raster Algorithms

Programmable GPUs. Real Time Graphics 11/13/2013. Nalu 2004 (NVIDIA Corporation) GeForce 6. Virtua Fighter 1995 (SEGA Corporation) NV1

COMP 4801 Final Year Project. Ray Tracing for Computer Graphics. Final Project Report FYP Runjing Liu. Advised by. Dr. L.Y.

GPGPU. Peter Laurens 1st-year PhD Student, NSC

RSX Best Practices. Mark Cerny, Cerny Games David Simpson, Naughty Dog Jon Olick, Naughty Dog

Computer and Machine Vision

Deferred Rendering Due: Wednesday November 15 at 10pm

Shadows & Decals: D3D10 techniques from Frostbite. Johan Andersson Daniel Johansson

Graphics Hardware. Instructor Stephen J. Guy

Computer Graphics. Texture Filtering & Sampling Theory. Hendrik Lensch. Computer Graphics WS07/08 Texturing

GpuPy: Accelerating NumPy With a GPU

GPU Image Processing. SIGGRAPH 2004 Frank Jargstorff

CSCI 420 Computer Graphics Lecture 14. Rasterization. Scan Conversion Antialiasing [Angel Ch. 6] Jernej Barbic University of Southern California

Portland State University ECE 588/688. Graphics Processors

Filtering theory: Battling Aliasing with Antialiasing. Tomas Akenine-Möller Department of Computer Engineering Chalmers University of Technology

C P S C 314 S H A D E R S, O P E N G L, & J S RENDERING PIPELINE. Mikhail Bessmeltsev

DEFERRED RENDERING STEFAN MÜLLER ARISONA, ETH ZURICH SMA/

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

Rendering. Converting a 3D scene to a 2D image. Camera. Light. Rendering. View Plane

Using Streaming SIMD Extensions in a Fast DCT Algorithm for MPEG Encoding

Rasterization. Rasterization (scan conversion) Digital Differential Analyzer (DDA) Rasterizing a line. Digital Differential Analyzer (DDA)

Fall CSCI 420: Computer Graphics. 7.1 Rasterization. Hao Li.

CMSC427 Advanced shading getting global illumination by local methods. Credit: slides Prof. Zwicker

JPEG decoding using end of block markers to concurrently partition channels on a GPU. Patrick Chieppe (u ) Supervisor: Dr.

Introduction ti to JPEG

Applications of Explicit Early-Z Z Culling. Jason Mitchell ATI Research

Render all data necessary into textures Process textures to calculate final image

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager

Sign up for crits! Announcments

Windowing System on a 3D Pipeline. February 2005

Fast Third-Order Texture Filtering

Module Introduction. Content 15 pages 2 questions. Learning Time 25 minutes

Game Graphics Programmers

Rendering Subdivision Surfaces Efficiently on the GPU

GPU Memory Model Overview

3D Rendering Pipeline

E.Order of Operations

Rendering Objects. Need to transform all geometry then

Interactive Summed-Area Table Generation for Glossy Environmental Reflections

COMP371 COMPUTER GRAPHICS

About Phoenix FD PLUGIN FOR 3DS MAX AND MAYA. SIMULATING AND RENDERING BOTH LIQUIDS AND FIRE/SMOKE. USED IN MOVIES, GAMES AND COMMERCIALS.

CS230 : Computer Graphics Lecture 4. Tamar Shinar Computer Science & Engineering UC Riverside

Programming Graphics Hardware

Real Time Rendering of Complex Height Maps Walking an infinite realistic landscape By: Jeffrey Riaboy Written 9/7/03

Data-Parallel Algorithms on GPUs. Mark Harris NVIDIA Developer Technology

Moving Mobile Graphics Advanced Real-time Shadowing. Marius Bjørge ARM

Transcription:

Image Processing Tricks in OpenGL Simon Green NVIDIA Corporation

Overview Image Processing in Games Histograms Recursive filters JPEG Discrete Cosine Transform

Image Processing in Games Image processing is increasingly important in video games Games are becoming more like movies a large part of the final look is determined in post color correction, blurs, depth of field, motion blur Important for accelerating offline tools too pre-processing processing (lightmaps( lightmaps) texture compression

Image Histograms Image histograms give frequency of occurrence of each intensity level in image useful for image analysis, HDR tone mapping algorithms OpenGL imaging subset has histogram functions but this is not widely supported Solution - calculate histograms using multiple passes and occlusion query

Histograms using Occlusion Query Render scene to texture For each bucket in histogram Begin occlusion query Draw quad with scene texture Use fragment program that discards fragments outside appropriate luminance range End occlusion query Get number of fragments that passed, store in histogram array Process histogram Requires n passes for n buckets

Histogram Fragment Program float4 main(in float4 wpos : WPOS, uniform samplerrect tex, uniform float min, uniform float max, uniform float3 channels ) : COLOR { // fetch color from texture float4 c = texrect(tex, wpos.xy); // calculate luminance or select channel float lum = dot(channels, c.rgb); // discard pixel if not inside range if (lum( < min lum >= max) discard; } return c;

Histogram Demo

Performance Depends on image size, number of passes 40fps for 32 bucket histogram on 512 x 512 image, GeForce 5900 For large histograms, may be faster to readback and compute on CPU

Recursive (IIR) Image Filters Most existing blur implementations use standard convolution filter output is only function of surrounding pixels If we scan through the image, can we make use of the previous filter outputs? Output of a recursive filter is function of previous inputs and previous outputs feedback! Simple recursive filter y[n] = a*y[n-1] + (1-a)*x[n]

Recursive Image Filters Require fewer samples for given frequency response Can produce arbitrarily wide blurs for constant cost this is why Gaussian blurs in Photoshop take same amount of time regardless of width But difficult to analyze and control like a control system, trying to follow its input mathematics is very complicated!

FIR vs. IIR Impulse response of filter is how it responds to unit impulse (discrete delta function): also known as point spread function Finite Impulse Response (FIR) response to impulse stops outside filter footprint stable Infinite Impulse Response (IIR) response to impulse can go on forever can be unstable widely used in digital signal processing

Review: Building Summed Area Tables using Graphics Hardware Presented at GDC 2003 Each texel in SAT is the sum of all texels below and to the left of it Implemented by rendering lines using render-to to-texturetexture Sum columns first, and then rows Each row or column is rendered as a line primitive Fragment program adds value of current texel with texel to the left or below

Building Summed Area Table 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 4 8 12 16 3 6 9 12 2 4 6 8 1 2 3 4 Original image Sum columns Sum rows For n x m image, requires rendering 2 x n x m pixels, each of which performs two texture lookups

Problems With This Technique Texturing from same buffer you are rendering to can produce undefined results e.g. Texture cache changed from NV3x to NV4x broke SAT demo Don t t rely on undefined behaviour! Line primitives do not make very efficient use of rasterizer or shader hardware Most modern graphics hardware processes groups of pixels in parallel

Solutions Use two buffers, ping-pong pong between them Copy changes back from destination buffer to source each pass Buffer switching is fast with framebuffer object extension Can also unroll loop so that we render 2 x n quads instead of lines Unroll fragment program so that it does computations for two fragments Use per-vertex color to determine if we re rendering odd or even row/column

Implementing IIR Image Filters Can implement recursive (IIR) image filters using same technique as summed area table Scan through image, rendering line or quad primitives Fragment program reads from previous output buffer and previous input buffer, writes to third buffer Process rows, then columns

Simple IIR Filter float4 main(vf30 In, uniform samplerrect y, // out uniform samplerrect x, // in uniform float4 delta, uniform float4 a, // filter coefficients ) : COLOR { float2 n = In.WPOS.xy); // current float2 nm1 = n + delta.xy; ; // previous } return lerp(texrect(y,, nm1), texrect(x,, n), a[0]);

Simple IIR Filter (Before)

Simple IIR Filter (After)

Symmetric Recursive Filtering Recursive filters are directional Causes phase shift of data Not a problem for time series (e.g. audio), but very obvious with images Can combine multiple recursive filters to construct zero-phase shift filter Run filter in positive direction (left to right) first, and then in negative direction (right to left) Phase shifts cancel out

Original Image

Result after Filter in Positive X & Y

Result after Filter in Negative X & Y

Resonant Image Filters Second order IIR filters can produce more interesting effects: y[n] = b0*x[n] + b1*x[n-1] 1] + b2*x[n-2] 2] a1*y[n-1] 1] a2*y[n-2] 2] Close model of analog electronic filters in real world (resistor / capacitor) Act like damped oscillators Can produce interesting non- photorealistic looks in image domain

Second Order IIR Filter float4 main(vf30 In, uniform samplerrect y, // out uniform samplerrect x, // in uniform float4 delta, uniform float4 a, // filter coefficients uniform float4 b ) : COLOR { float2 n = In.WPOS.xy); // current float2 nm1 = n + delta.xy; ; // previous float2 nm2 = n + delta.zw; } // second order IIR return b[0]*texrect(x texrect(x,, n) + b[1]*texrect(x texrect(x,, nm1) + b[2]*texrect(x texrect(x,, nm2) - a[1]*texrect(y texrect(y,, nm1) - a[2]* texrect(y,, nm2);

Resonant Image Filters

Resonant Image Filters

Resonant Image Filters

Discrete Cosine Transform DCT is similar to discrete Fourier transform Transforms image from spatial domain to frequency domain (and back) Used in JPEG and MPEG compression

DCT Basis Images

Performing The DCT in Shader Shader implementation based on work of the Independent JPEG Group monochrome (currently) floating point Could be used as part of a GPU- accelerated compressor/decompressor File decoding, Huffman compression would still need to be done on CPU Game applications None!

DCT Operation DCT used in JPEG operates on 8x8 pixel blocks Trade-off between 2D DCT is separable into 1D DCT on rows, followed by 1D DCT on columns Arai, Agui,, and Nakajima's algorithm 5 multiplies and 29 adds for 8 pixels Other multiplies are simple scales of output values

Partitioning the DCT Problem: 1D DCT is a function of 8 inputs, produces 8 outputs Shader likes n inputs, 1 output per pixel don t t want to duplicate effort across pixels Solution: Render quad 1/8 th width or height Shader reads 8 neighboring texels Writes 8 outputs to RGBA components of two render targets using MRT Data is unpacked on subsequent passes

Partitioning the DCT (Rows) n n/8 n n inputs outputs 0 1 2 3 4 5 6 7 0 1 2 3 0 1 2 3 shader

FDCT Shader Code // based on IJG jfdctflt.c void DCT(float d[8], out float4 output0, out float4 output1) { float tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7; float tmp10, tmp11, tmp12, tmp13; float z1, z2, z3, z4, z5, z11, z13; tmp0 = d[0] + d[7]; tmp7 = d[0] - d[7]; tmp1 = d[1] + d[6]; tmp6 = d[1] - d[6]; tmp2 = d[2] + d[5]; tmp5 = d[2] - d[5]; tmp3 = d[3] + d[4]; tmp4 = d[3] - d[4]; /* Even part */ tmp10 = tmp0 + tmp3; /* phase 2 */ tmp13 = tmp0 - tmp3; tmp11 = tmp1 + tmp2; tmp12 = tmp1 - tmp2; /* Odd part */ tmp10 = tmp4 + tmp5; /* phase 2 */ tmp11 = tmp5 + tmp6; tmp12 = tmp6 + tmp7; /* The rotator is modified from fig 4-84 8 to avoid extra negations. */ z5 = (tmp10 - tmp12) * 0.382683433; /* c6 */ z2 = 0.541196100 * tmp10 + z5; /* c2-c6 c6 */ z4 = 1.306562965 * tmp12 + z5; /* c2+c6 */ z3 = tmp11 * 0.707106781; /* c4 */ z11 = tmp7 + z3; /* phase 5 */ z13 = tmp7 - z3; output1[0] = z13 + z2; /* phase 6 */ output1[1] = z13 - z2; output1[2] = z11 + z4; output1[3] = z11 - z4; } output0[0] = tmp10 + tmp11; /* phase 3 */ output0[1] = tmp10 - tmp11; z1 = (tmp12 + tmp13) * 0.707106781; /* c4 */ output0[2] = tmp13 + z1; /* phase 5 */ output0[3] = tmp13 - z1;

Unpacking Code float4 DCT_unpack_rows_PS(float2 texcoord : TEXCOORD0, uniform samplerrect image, uniform samplerrect image2 ) : COLOR { float2 uv = texcoord * float2(1.0/8.0, 1.0); float4 c = texrect(image, uv); float4 c2 = texrect(image2, uv); } // rearrange data into correct order // x y z w // c 0 4 2 6 // c2 5 3 1 7 int i = frac(texcoord.x/8.0) * 8.0; float4 sel0 = (i == float4(0, 4, 2, 6)); float4 sel1 = (i == float4(5, 3, 1, 7)); return dot(c, sel0) + dot(c2, sel1);

Original Image

After FDCT (DCT coefficients)

After IDCT

Performance Around 160fps for FDCT followed by IDCT on 512 x 512 monochrome image on GeForce 6800 Ultra Still a lot of room for optimization make better use of vector math could process two channels simultaneously (4 MRTs) JPEGs are usually stored as luminance and 2 chrominance channels Chroma is at lower resolution Could also do resampling and color space conversion on GPU

Questions?

References Infinite Impulse Response Filters on Wikipedia The JPEG Still Picture Compression Standard,, Wallace G, Communications of the ACM Volume 34, Issue 4 Discrete Cosine Transform on Wikipedia