Overview. Videos are everywhere. But can take up large amounts of resources. Exploit redundancy to reduce file size

Similar documents
CS 335 Graphics and Multimedia. Image Compression

06/12/2017. Image compression. Image compression. Image compression. Image compression. Coding redundancy: image 1 has four gray levels

Review and Implementation of DWT based Scalable Video Coding with Scalable Motion Coding.

ECE 417 Guest Lecture Video Compression in MPEG-1/2/4. Min-Hsuan Tsai Apr 02, 2013

Topic 5 Image Compression

Perceptual Coding. Lossless vs. lossy compression Perceptual models Selecting info to eliminate Quantization and entropy encoding

Image Transformation Techniques Dr. Rajeev Srivastava Dept. of Computer Engineering, ITBHU, Varanasi

IMAGE COMPRESSION. Image Compression. Why? Reducing transportation times Reducing file size. A two way event - compression and decompression

2014 Summer School on MPEG/VCEG Video. Video Coding Concept

Using Virtual Texturing to Handle Massive Texture Data

Mali GPU acceleration of HEVC and VP9 Decoder

Image Compression. CS 6640 School of Computing University of Utah

Fundamentals of Video Compression. Video Compression

MAPPING VIDEO CODECS TO HETEROGENEOUS ARCHITECTURES. Mauricio Alvarez-Mesa Techische Universität Berlin - Spin Digital MULTIPROG 2015

Introduction to Video Compression

Final Review. Image Processing CSE 166 Lecture 18

Image Compression for Mobile Devices using Prediction and Direct Coding Approach

Digital Image Representation Image Compression

New Perspectives on Image Compression

Reversible Wavelets for Embedded Image Compression. Sri Rama Prasanna Pavani Electrical and Computer Engineering, CU Boulder

Video Compression An Introduction

High Efficiency Video Coding. Li Li 2016/10/18

A Comparative Study of DCT, DWT & Hybrid (DCT-DWT) Transform

Lecture 5: Error Resilience & Scalability

Image and Video Compression Fundamentals

Lecture 10 Video Coding Cascade Transforms H264, Wavelets

Introduction to Video Encoding

ECE 533 Digital Image Processing- Fall Group Project Embedded Image coding using zero-trees of Wavelet Transform

Image Processing Tricks in OpenGL. Simon Green NVIDIA Corporation

DIGITAL IMAGE PROCESSING WRITTEN REPORT ADAPTIVE IMAGE COMPRESSION TECHNIQUES FOR WIRELESS MULTIMEDIA APPLICATIONS

7.5 Dictionary-based Coding

15 Data Compression 2014/9/21. Objectives After studying this chapter, the student should be able to: 15-1 LOSSLESS COMPRESSION

Advanced Video Coding: The new H.264 video compression standard

HYBRID TRANSFORMATION TECHNIQUE FOR IMAGE COMPRESSION

Compression of Stereo Images using a Huffman-Zip Scheme

Video Compression MPEG-4. Market s requirements for Video compression standard

Implementation of Lifting-Based Two Dimensional Discrete Wavelet Transform on FPGA Using Pipeline Architecture

MPEG-4: Simple Profile (SP)

VC 12/13 T16 Video Compression

International Journal of Emerging Technology and Advanced Engineering Website: (ISSN , Volume 2, Issue 4, April 2012)

DIAGONAL VECTORISATION OF 2-D WAVELET LIFTING

Image Compression using Discrete Wavelet Transform Preston Dye ME 535 6/2/18

Parallelism In Video Streaming

Height field ambient occlusion using CUDA

Professor Laurence S. Dooley. School of Computing and Communications Milton Keynes, UK

Image compression. Stefano Ferrari. Università degli Studi di Milano Methods for Image Processing. academic year

Image Compression Algorithm and JPEG Standard

GPU-Based DWT Acceleration for JPEG2000

Interframe coding A video scene captured as a sequence of frames can be efficiently coded by estimating and compensating for motion between frames pri

Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks Amirali Boroumand

IMAGE COMPRESSION USING HYBRID TRANSFORM TECHNIQUE

Image Coding and Data Compression

Multimedia Systems Video II (Video Coding) Mahdi Amiri April 2012 Sharif University of Technology

EFFICIENT DEISGN OF LOW AREA BASED H.264 COMPRESSOR AND DECOMPRESSOR WITH H.264 INTEGER TRANSFORM

Module 7 VIDEO CODING AND MOTION ESTIMATION

Engineering Mathematics II Lecture 16 Compression

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

AN ANALYTICAL STUDY OF LOSSY COMPRESSION TECHINIQUES ON CONTINUOUS TONE GRAPHICAL IMAGES

Digital Video Processing

ISSN (ONLINE): , VOLUME-3, ISSUE-1,

Part 1 of 4. MARCH

3. Lifting Scheme of Wavelet Transform

THE TRANSFORM AND DATA COMPRESSION HANDBOOK

Week 14. Video Compression. Ref: Fundamentals of Multimedia

ESE532: System-on-a-Chip Architecture. Today. Message. Project. Expect. Why MPEG Encode? MPEG Encoding Project Motion Estimation DCT Entropy Encoding

Digital Image Processing

Haar Wavelet Image Compression

Using GPUs to compute the multilevel summation of electrostatic forces

An Improved Complex Spatially Scalable ACC DCT Based Video Compression Method

DIGITAL IMAGE PROCESSING

CSEP 521 Applied Algorithms Spring Lossy Image Compression

Compression Part 2 Lossy Image Compression (JPEG) Norm Zeck

Introduction to Video Coding

CS427 Multicore Architecture and Parallel Computing

13.6 FLEXIBILITY AND ADAPTABILITY OF NOAA S LOW RATE INFORMATION TRANSMISSION SYSTEM

JPEG Descrizione ed applicazioni. Arcangelo Bruna. Advanced System Technology

Keywords - DWT, Lifting Scheme, DWT Processor.

A Review on Digital Image Compression Techniques

[Singh*, 5(3): March, 2016] ISSN: (I2OR), Publication Impact Factor: 3.785

IMAGE COMPRESSION. October 7, ICSY Lab, University of Kaiserslautern, Germany

REDUCING BEAMFORMING CALCULATION TIME WITH GPU ACCELERATED ALGORITHMS

TKT-2431 SoC design. Introduction to exercises

Compression of Light Field Images using Projective 2-D Warping method and Block matching

CHAPTER 4 REVERSIBLE IMAGE WATERMARKING USING BIT PLANE CODING AND LIFTING WAVELET TRANSFORM

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Lecture 12 Video Coding Cascade Transforms H264, Wavelets

CHAPTER 3 WAVELET DECOMPOSITION USING HAAR WAVELET

JPEG 2000 compression

Module 8: Video Coding Basics Lecture 42: Sub-band coding, Second generation coding, 3D coding. The Lecture Contains: Performance Measures

Media - Video Coding: Standards

CHAPTER 3 DIFFERENT DOMAINS OF WATERMARKING. domain. In spatial domain the watermark bits directly added to the pixels of the cover

GPU Accelerating Speeded-Up Robust Features Timothy B. Terriberry, Lindley M. French, and John Helmsen

Multimedia Networking ECE 599

How an MPEG-1 Codec Works

GRAPHICS PROCESSING UNITS

IMAGE PROCESSING (RRY025) LECTURE 13 IMAGE COMPRESSION - I

A contourlet transform based algorithm for real-time video encoding

A Image Comparative Study using DCT, Fast Fourier, Wavelet Transforms and Huffman Algorithm

FAST AND EFFICIENT LOSSLESS IMAGE COMPRESSION BASED ON CUDA PARALLEL WAVELET TREE ENCODING. Jingqi Ao, B.S.E.E, M.S.E.E.

High Quality DXT Compression using OpenCL for CUDA. Ignacio Castaño

Transcription:

Overview Videos are everywhere But can take up large amounts of resources Disk space Memory Network bandwidth Exploit redundancy to reduce file size Spatial Temporal

General lossless compression Huffman compression shorter bit sequences for common data Lempel Ziv short bit sequence for previously seen strings

Transform coding Perform some transformation on data Does not reduce data size, usually theoretically lossless Concentrate information in a small(er) number of data points Quantize data (lossy) Most data points are smaller numbers Losslessly compress data stream The typical range of data is smaller Fewer bits required to store common case

Discrete Cosine Transform (DCT) Traditional lossy compression Converts a function of time to a function of frequency Weighted sum of cosine functions Information from the original signal can be completely reconstructed from generated weights FFT: O(NlogN) vs. O(N^2)

2D DCT Treat each row of the signal as a 1D signal, perform 1D transform Treat each column of the transformed signal as a 1D signal, perform another 1D transform Separable transformation 2nk vs. nk^2 3D extension?

Transform coding DCT itself does not perform any compression Images concentrate most of their information in lowfrequency components High frequency components can be stored with less precision human visual system Often high frequency components round to zero and loss of information not noticeable

Global transform DCT acts on an entire signal So perform on image blocks One value per frequency for an entire block Block Artefacts Image discontinuities Sharp edges dividing otherwise relatively low frequency areas High frequency components localized to small number of pixels DCT is less effective at representing these compactly

Discrete Wavelet Transform (DWT) Decomposition into two signals, with half resolution of input Approximation signal low res version of original Contains only low frequencies Detail signal Information lost be reducing the resolution Contains only high frequencies

Discrete Wavelet Transform (DWT) Approximation signal recursively transformed Image entirely converted to detail signals of various resolutions Final result is effectively a sum of scaled and translated versions of a wavelet (small portion of a wave) Wavelets have location, waves have phase Avoids undershoot and ringing 2D DWT often separable (though depends on wavelet) Square decomposition

The Haar Wavelet

More complicated wavelets

Locality Detail signal is not transformed Despite being high frequency, discontinuities will remain localized Can be less effective for periodic signals, better for images

Motion compensation Calculate motion direction of parts of an image Temporal coherence: Similarity between neighboring video frames Global describe motion of camera Local describe motion of small objects (within a block of an image) Motion compensation => a next frame prediction Residue (difference from prediction) is stored

Accelerating Wavelet Based Video Coding on Graphics Hardware using CUDA Wladimir J. van der Laan, Jos B.T.M. Roerdink, Andrei C. Jalba

Dirac Wavelet Video Codec (DWVC) Video compression format Open source, royalty free alternative to H.264; roughly equivalent quality BBC Research Dirac research reference implementation Schrödinger high performance Heavily optimized Good basis for performance comparison

DWVC Decoding Stream data Intra frames self contained images Inter frames difference with respect to one or two reference frames Arithmetic decoding lossless; extracts parameters, vectors, coefficients from bitstream Reversed entropy coder, which represents common values with shorter bit sequences Little inherent parallelism handled by CPU Motion compensation Residue (difference from prediction) stored as wavelet coefficients

CUDA Implementation Use CUDA to avoid mapping decoding process to rendering pipeline Lifting scheme less arithmetic, in place Frame arithmetic 16 vs 32 bit? Sub pixel precision Bicubic interpolation of reference frame

Separable transformation for wavelet lifting Decompose 2D op into 2 1D ops

Horizontal Pass Coalesced read part of a row Duplicate border elements boundary conditions Shared memory: in place lifting Syncthreads after each step in transform Coalesced write back to global Reorganized coefficients based on JPEG 2000 cacheefficient wavelet lifting

Vertical pass Substituting rows for columns > poor coalescing Each block processes multiple columns: a slab Each row in a slab can be read with coalescing Shared memory: transform on columns Sliding window not all columns can fit in shared

Motion compensation Block placement Traditional Divide image into equally sized, disjoint blocks Strong discontinuities between neighboring blocks Poor prediction on block edges Overlapped Block Motion Compensation Overlaps neighboring blocks Blending together in shared area

Reference frame options Previous frame Previous and next (blended together with some weights) for fades A different frame several frames back if better match

Overlapped blocks Each pixel part of up to four motion compensation blocks per frame Naïve implementation Equally sized CUDA blocks Complicated flow control neighboring pixels access different motion comp. blocks

Solution: Divide image into regions Based on number of and orientation of overlapping blocks Center 1 block Edges 2 blocks (H or V overlap), linear blend Corners 4 blocks, bilinear blend All pixels in a region have same code Each region is processed by one CUDA block No block divergent branching Texture faster than constant memory Each thread potentially accesses a different location

Results Dual Core AMD Opteron 280 vs Nvidia GeForce GTX280, CUDA 2.2 Single threaded GPU times do not require readback (video is displayed through OpenGL textures) 5.4x overall speedup for entire decode process 13x speedup for GPU operations (arithmetic decoding excluded) 1920x1080 (1080p) displayed at 56.4fps 25 fps needed for movie playback 10.5 fps for CPU reference

Parallel Implementation of the 2D Discrete Wavelet Transform on Graphics Processing Units: Filter Bank versus Lifting Christian Tenllado, Javier Setoain, Manuel Prieto, Luis Piñuel, Francisco Tirado

Focus on DWT Has other image processing/computer graphics applications multiresolution analysis Primary methods: Filter bank Lifting scheme

Filter bank Given signal A: Run low pass filter (convolution) on A to get low frequency approximation (~blur) Run corresponding high pass filter on A to get high frequency details Halve frequency of both (since we now have twice as much information as necessary) Recurse on approximation Direct translation of definition of wavelet transform

Lifting scheme Combine highpass and lowpass filters Any FBS wavelet can be factorized into several LS steps with Polyphase Matrix representation Split signal into odd/even values (lazy wavelet transform) Predict Update

LS Advantages Simple to invert: run in opposite direction (no reverse convolution) Method for producing wavelet transforms Control over the actual operations that are executed Can use integer operations > lossless compression Easy to generalize + must be invertible but doesn t have to be + Tends to be more efficient w.r.t. amount of hardware or power consumption for embedded systems

FBS vs LS Speed CPU: LS up to twice the speed of FBS Performs about half as many computations Though actual gains are often smaller than theoretical In place transform LS is default way to implement wavelet transform seen as most efficient GPU: FBS is actually faster Fewer synchronization barriers

Implementation OpenGL + Cg Layout: 2x2 locks stored in RGBA texel allows H and V algorithms to be designed symmetrically Filter bank synch barrier between H and V filters Lifting scheme Several loops to perform simple vector operations on each data stream Every LS step performed by a different kernel Many synch barriers

Results Execution times scale linearly with problem size Ratio of LS time to FBS time > constant as size grows Speedups from Nvidia FX 5950 Ultra (2003) to 7800 GTX (2005) 4x for FBS 2.2x for LS

Results Key performance factor is # rendering passes and synch barriers FBS doesn t require pipeline flush, allows better parallelization LS: removing synch barriers (incorrect output, but good performance estimate) 1.4x speedup GPU: 1.2 3.4x speedup over CPU implementation w/o data transfer Transform a 4M pixel image in 9.12 and 17.9 ms using FBS and LS using Daubechies 4 Slower times for more complicated wavelets

Future improvements LS/FBS time ratio grows as # shader processors increase future GPUs will progressively favor FBS Waiting for better CPU/GPU integration Suggest fusing consecutive kernels increased complexity, but faster

Summary GPU allows several times speedup over CPU for decompression with modern codecs May not seem dramatic, but helps cross barrier over movie fps rate Allows more types of compression algorithms to become feasible Methods for implementation best for CPU may not be best for GPU