Overview. Videos are everywhere. But can take up large amounts of resources. Exploit redundancy to reduce file size

Overview Videos are everywhere But can take up large amounts of resources Disk space Memory Network bandwidth Exploit redundancy to reduce file size Spatial Temporal

General lossless compression Huffman compression shorter bit sequences for common data Lempel Ziv short bit sequence for previously seen strings

Transform coding Perform some transformation on data Does not reduce data size, usually theoretically lossless Concentrate information in a small(er) number of data points Quantize data (lossy) Most data points are smaller numbers Losslessly compress data stream The typical range of data is smaller Fewer bits required to store common case

Discrete Cosine Transform (DCT) Traditional lossy compression Converts a function of time to a function of frequency Weighted sum of cosine functions Information from the original signal can be completely reconstructed from generated weights FFT: O(NlogN) vs. O(N^2)

2D DCT Treat each row of the signal as a 1D signal, perform 1D transform Treat each column of the transformed signal as a 1D signal, perform another 1D transform Separable transformation 2nk vs. nk^2 3D extension?

Transform coding DCT itself does not perform any compression Images concentrate most of their information in lowfrequency components High frequency components can be stored with less precision human visual system Often high frequency components round to zero and loss of information not noticeable

Global transform DCT acts on an entire signal So perform on image blocks One value per frequency for an entire block Block Artefacts Image discontinuities Sharp edges dividing otherwise relatively low frequency areas High frequency components localized to small number of pixels DCT is less effective at representing these compactly

Discrete Wavelet Transform (DWT) Decomposition into two signals, with half resolution of input Approximation signal low res version of original Contains only low frequencies Detail signal Information lost be reducing the resolution Contains only high frequencies

Discrete Wavelet Transform (DWT) Approximation signal recursively transformed Image entirely converted to detail signals of various resolutions Final result is effectively a sum of scaled and translated versions of a wavelet (small portion of a wave) Wavelets have location, waves have phase Avoids undershoot and ringing 2D DWT often separable (though depends on wavelet) Square decomposition

The Haar Wavelet

More complicated wavelets

Locality Detail signal is not transformed Despite being high frequency, discontinuities will remain localized Can be less effective for periodic signals, better for images

Motion compensation Calculate motion direction of parts of an image Temporal coherence: Similarity between neighboring video frames Global describe motion of camera Local describe motion of small objects (within a block of an image) Motion compensation => a next frame prediction Residue (difference from prediction) is stored

Accelerating Wavelet Based Video Coding on Graphics Hardware using CUDA Wladimir J. van der Laan, Jos B.T.M. Roerdink, Andrei C. Jalba

Dirac Wavelet Video Codec (DWVC) Video compression format Open source, royalty free alternative to H.264; roughly equivalent quality BBC Research Dirac research reference implementation Schrödinger high performance Heavily optimized Good basis for performance comparison

DWVC Decoding Stream data Intra frames self contained images Inter frames difference with respect to one or two reference frames Arithmetic decoding lossless; extracts parameters, vectors, coefficients from bitstream Reversed entropy coder, which represents common values with shorter bit sequences Little inherent parallelism handled by CPU Motion compensation Residue (difference from prediction) stored as wavelet coefficients

CUDA Implementation Use CUDA to avoid mapping decoding process to rendering pipeline Lifting scheme less arithmetic, in place Frame arithmetic 16 vs 32 bit? Sub pixel precision Bicubic interpolation of reference frame

Separable transformation for wavelet lifting Decompose 2D op into 2 1D ops

Horizontal Pass Coalesced read part of a row Duplicate border elements boundary conditions Shared memory: in place lifting Syncthreads after each step in transform Coalesced write back to global Reorganized coefficients based on JPEG 2000 cacheefficient wavelet lifting

Vertical pass Substituting rows for columns > poor coalescing Each block processes multiple columns: a slab Each row in a slab can be read with coalescing Shared memory: transform on columns Sliding window not all columns can fit in shared

Motion compensation Block placement Traditional Divide image into equally sized, disjoint blocks Strong discontinuities between neighboring blocks Poor prediction on block edges Overlapped Block Motion Compensation Overlaps neighboring blocks Blending together in shared area

Reference frame options Previous frame Previous and next (blended together with some weights) for fades A different frame several frames back if better match

Overlapped blocks Each pixel part of up to four motion compensation blocks per frame Naïve implementation Equally sized CUDA blocks Complicated flow control neighboring pixels access different motion comp. blocks

Solution: Divide image into regions Based on number of and orientation of overlapping blocks Center 1 block Edges 2 blocks (H or V overlap), linear blend Corners 4 blocks, bilinear blend All pixels in a region have same code Each region is processed by one CUDA block No block divergent branching Texture faster than constant memory Each thread potentially accesses a different location

Results Dual Core AMD Opteron 280 vs Nvidia GeForce GTX280, CUDA 2.2 Single threaded GPU times do not require readback (video is displayed through OpenGL textures) 5.4x overall speedup for entire decode process 13x speedup for GPU operations (arithmetic decoding excluded) 1920x1080 (1080p) displayed at 56.4fps 25 fps needed for movie playback 10.5 fps for CPU reference

Parallel Implementation of the 2D Discrete Wavelet Transform on Graphics Processing Units: Filter Bank versus Lifting Christian Tenllado, Javier Setoain, Manuel Prieto, Luis Piñuel, Francisco Tirado

Focus on DWT Has other image processing/computer graphics applications multiresolution analysis Primary methods: Filter bank Lifting scheme

Filter bank Given signal A: Run low pass filter (convolution) on A to get low frequency approximation (~blur) Run corresponding high pass filter on A to get high frequency details Halve frequency of both (since we now have twice as much information as necessary) Recurse on approximation Direct translation of definition of wavelet transform

Lifting scheme Combine highpass and lowpass filters Any FBS wavelet can be factorized into several LS steps with Polyphase Matrix representation Split signal into odd/even values (lazy wavelet transform) Predict Update

LS Advantages Simple to invert: run in opposite direction (no reverse convolution) Method for producing wavelet transforms Control over the actual operations that are executed Can use integer operations > lossless compression Easy to generalize + must be invertible but doesn t have to be + Tends to be more efficient w.r.t. amount of hardware or power consumption for embedded systems

FBS vs LS Speed CPU: LS up to twice the speed of FBS Performs about half as many computations Though actual gains are often smaller than theoretical In place transform LS is default way to implement wavelet transform seen as most efficient GPU: FBS is actually faster Fewer synchronization barriers

Implementation OpenGL + Cg Layout: 2x2 locks stored in RGBA texel allows H and V algorithms to be designed symmetrically Filter bank synch barrier between H and V filters Lifting scheme Several loops to perform simple vector operations on each data stream Every LS step performed by a different kernel Many synch barriers

Results Execution times scale linearly with problem size Ratio of LS time to FBS time > constant as size grows Speedups from Nvidia FX 5950 Ultra (2003) to 7800 GTX (2005) 4x for FBS 2.2x for LS

Results Key performance factor is # rendering passes and synch barriers FBS doesn t require pipeline flush, allows better parallelization LS: removing synch barriers (incorrect output, but good performance estimate) 1.4x speedup GPU: 1.2 3.4x speedup over CPU implementation w/o data transfer Transform a 4M pixel image in 9.12 and 17.9 ms using FBS and LS using Daubechies 4 Slower times for more complicated wavelets

Future improvements LS/FBS time ratio grows as # shader processors increase future GPUs will progressively favor FBS Waiting for better CPU/GPU integration Suggest fusing consecutive kernels increased complexity, but faster

Summary GPU allows several times speedup over CPU for decompression with modern codecs May not seem dramatic, but helps cross barrier over movie fps rate Allows more types of compression algorithms to become feasible Methods for implementation best for CPU may not be best for GPU