Overview Videos are everywhere But can take up large amounts of resources Disk space Memory Network bandwidth Exploit redundancy to reduce file size Spatial Temporal
General lossless compression Huffman compression shorter bit sequences for common data Lempel Ziv short bit sequence for previously seen strings
Transform coding Perform some transformation on data Does not reduce data size, usually theoretically lossless Concentrate information in a small(er) number of data points Quantize data (lossy) Most data points are smaller numbers Losslessly compress data stream The typical range of data is smaller Fewer bits required to store common case
Discrete Cosine Transform (DCT) Traditional lossy compression Converts a function of time to a function of frequency Weighted sum of cosine functions Information from the original signal can be completely reconstructed from generated weights FFT: O(NlogN) vs. O(N^2)
2D DCT Treat each row of the signal as a 1D signal, perform 1D transform Treat each column of the transformed signal as a 1D signal, perform another 1D transform Separable transformation 2nk vs. nk^2 3D extension?
Transform coding DCT itself does not perform any compression Images concentrate most of their information in lowfrequency components High frequency components can be stored with less precision human visual system Often high frequency components round to zero and loss of information not noticeable
Global transform DCT acts on an entire signal So perform on image blocks One value per frequency for an entire block Block Artefacts Image discontinuities Sharp edges dividing otherwise relatively low frequency areas High frequency components localized to small number of pixels DCT is less effective at representing these compactly
Discrete Wavelet Transform (DWT) Decomposition into two signals, with half resolution of input Approximation signal low res version of original Contains only low frequencies Detail signal Information lost be reducing the resolution Contains only high frequencies
Discrete Wavelet Transform (DWT) Approximation signal recursively transformed Image entirely converted to detail signals of various resolutions Final result is effectively a sum of scaled and translated versions of a wavelet (small portion of a wave) Wavelets have location, waves have phase Avoids undershoot and ringing 2D DWT often separable (though depends on wavelet) Square decomposition
The Haar Wavelet
More complicated wavelets
Locality Detail signal is not transformed Despite being high frequency, discontinuities will remain localized Can be less effective for periodic signals, better for images
Motion compensation Calculate motion direction of parts of an image Temporal coherence: Similarity between neighboring video frames Global describe motion of camera Local describe motion of small objects (within a block of an image) Motion compensation => a next frame prediction Residue (difference from prediction) is stored
Accelerating Wavelet Based Video Coding on Graphics Hardware using CUDA Wladimir J. van der Laan, Jos B.T.M. Roerdink, Andrei C. Jalba
Dirac Wavelet Video Codec (DWVC) Video compression format Open source, royalty free alternative to H.264; roughly equivalent quality BBC Research Dirac research reference implementation Schrödinger high performance Heavily optimized Good basis for performance comparison
DWVC Decoding Stream data Intra frames self contained images Inter frames difference with respect to one or two reference frames Arithmetic decoding lossless; extracts parameters, vectors, coefficients from bitstream Reversed entropy coder, which represents common values with shorter bit sequences Little inherent parallelism handled by CPU Motion compensation Residue (difference from prediction) stored as wavelet coefficients
CUDA Implementation Use CUDA to avoid mapping decoding process to rendering pipeline Lifting scheme less arithmetic, in place Frame arithmetic 16 vs 32 bit? Sub pixel precision Bicubic interpolation of reference frame
Separable transformation for wavelet lifting Decompose 2D op into 2 1D ops
Horizontal Pass Coalesced read part of a row Duplicate border elements boundary conditions Shared memory: in place lifting Syncthreads after each step in transform Coalesced write back to global Reorganized coefficients based on JPEG 2000 cacheefficient wavelet lifting
Vertical pass Substituting rows for columns > poor coalescing Each block processes multiple columns: a slab Each row in a slab can be read with coalescing Shared memory: transform on columns Sliding window not all columns can fit in shared
Motion compensation Block placement Traditional Divide image into equally sized, disjoint blocks Strong discontinuities between neighboring blocks Poor prediction on block edges Overlapped Block Motion Compensation Overlaps neighboring blocks Blending together in shared area
Reference frame options Previous frame Previous and next (blended together with some weights) for fades A different frame several frames back if better match
Overlapped blocks Each pixel part of up to four motion compensation blocks per frame Naïve implementation Equally sized CUDA blocks Complicated flow control neighboring pixels access different motion comp. blocks
Solution: Divide image into regions Based on number of and orientation of overlapping blocks Center 1 block Edges 2 blocks (H or V overlap), linear blend Corners 4 blocks, bilinear blend All pixels in a region have same code Each region is processed by one CUDA block No block divergent branching Texture faster than constant memory Each thread potentially accesses a different location
Results Dual Core AMD Opteron 280 vs Nvidia GeForce GTX280, CUDA 2.2 Single threaded GPU times do not require readback (video is displayed through OpenGL textures) 5.4x overall speedup for entire decode process 13x speedup for GPU operations (arithmetic decoding excluded) 1920x1080 (1080p) displayed at 56.4fps 25 fps needed for movie playback 10.5 fps for CPU reference
Parallel Implementation of the 2D Discrete Wavelet Transform on Graphics Processing Units: Filter Bank versus Lifting Christian Tenllado, Javier Setoain, Manuel Prieto, Luis Piñuel, Francisco Tirado
Focus on DWT Has other image processing/computer graphics applications multiresolution analysis Primary methods: Filter bank Lifting scheme
Filter bank Given signal A: Run low pass filter (convolution) on A to get low frequency approximation (~blur) Run corresponding high pass filter on A to get high frequency details Halve frequency of both (since we now have twice as much information as necessary) Recurse on approximation Direct translation of definition of wavelet transform
Lifting scheme Combine highpass and lowpass filters Any FBS wavelet can be factorized into several LS steps with Polyphase Matrix representation Split signal into odd/even values (lazy wavelet transform) Predict Update
LS Advantages Simple to invert: run in opposite direction (no reverse convolution) Method for producing wavelet transforms Control over the actual operations that are executed Can use integer operations > lossless compression Easy to generalize + must be invertible but doesn t have to be + Tends to be more efficient w.r.t. amount of hardware or power consumption for embedded systems
FBS vs LS Speed CPU: LS up to twice the speed of FBS Performs about half as many computations Though actual gains are often smaller than theoretical In place transform LS is default way to implement wavelet transform seen as most efficient GPU: FBS is actually faster Fewer synchronization barriers
Implementation OpenGL + Cg Layout: 2x2 locks stored in RGBA texel allows H and V algorithms to be designed symmetrically Filter bank synch barrier between H and V filters Lifting scheme Several loops to perform simple vector operations on each data stream Every LS step performed by a different kernel Many synch barriers
Results Execution times scale linearly with problem size Ratio of LS time to FBS time > constant as size grows Speedups from Nvidia FX 5950 Ultra (2003) to 7800 GTX (2005) 4x for FBS 2.2x for LS
Results Key performance factor is # rendering passes and synch barriers FBS doesn t require pipeline flush, allows better parallelization LS: removing synch barriers (incorrect output, but good performance estimate) 1.4x speedup GPU: 1.2 3.4x speedup over CPU implementation w/o data transfer Transform a 4M pixel image in 9.12 and 17.9 ms using FBS and LS using Daubechies 4 Slower times for more complicated wavelets
Future improvements LS/FBS time ratio grows as # shader processors increase future GPUs will progressively favor FBS Waiting for better CPU/GPU integration Suggest fusing consecutive kernels increased complexity, but faster
Summary GPU allows several times speedup over CPU for decompression with modern codecs May not seem dramatic, but helps cross barrier over movie fps rate Allows more types of compression algorithms to become feasible Methods for implementation best for CPU may not be best for GPU