Area Efficient SAD Architecture for Block Based Video Compression Standards

Similar documents
MultiFrame Fast Search Motion Estimation and VLSI Architecture

POWER CONSUMPTION AND MEMORY AWARE VLSI ARCHITECTURE FOR MOTION ESTIMATION

Enhanced Hexagon with Early Termination Algorithm for Motion estimation

A VLSI Architecture for H.264/AVC Variable Block Size Motion Estimation

FPGA IMPLEMENTATION OF SUM OF ABSOLUTE DIFFERENCE (SAD) FOR VIDEO APPLICATIONS

International Journal of Emerging Technology and Advanced Engineering Website: (ISSN , Volume 2, Issue 4, April 2012)

Implementation of A Optimized Systolic Array Architecture for FSBMA using FPGA for Real-time Applications

Express Letters. A Simple and Efficient Search Algorithm for Block-Matching Motion Estimation. Jianhua Lu and Ming L. Liou

High Performance Hardware Architectures for A Hexagon-Based Motion Estimation Algorithm

An Efficient Fused Add Multiplier With MWT Multiplier And Spanning Tree Adder

A Quantized Transform-Domain Motion Estimation Technique for H.264 Secondary SP-frames

Sum to Modified Booth Recoding Techniques For Efficient Design of the Fused Add-Multiply Operator

High Performance VLSI Architecture of Fractional Motion Estimation for H.264/AVC

High Performance and Area Efficient DSP Architecture using Dadda Multiplier

A VIDEO TRANSCODING USING SPATIAL RESOLUTION FILTER INTRA FRAME METHOD IN MULTIMEDIA NETWORKS

FAST SPATIAL LAYER MODE DECISION BASED ON TEMPORAL LEVELS IN H.264/AVC SCALABLE EXTENSION

Prachi Sharma 1, Rama Laxmi 2, Arun Kumar Mishra 3 1 Student, 2,3 Assistant Professor, EC Department, Bhabha College of Engineering

OPTIMIZING THE POWER USING FUSED ADD MULTIPLIER

Analysis of Different Multiplication Algorithms & FPGA Implementation

Aiyar, Mani Laxman. Keywords: MPEG4, H.264, HEVC, HDTV, DVB, FIR.

IN RECENT years, multimedia application has become more

Multiresolution motion compensation coding for video compression

32-bit Signed and Unsigned Advanced Modified Booth Multiplication using Radix-4 Encoding Algorithm

Implementation of Efficient Modified Booth Recoder for Fused Sum-Product Operator

Fobe Algorithm for Video Processing Applications

[Kalyani*, 4.(9): September, 2015] ISSN: (I2OR), Publication Impact Factor: 3.785

A High Sensitive and Fast Motion Estimation for One Bit Transformation Using SSD

Paper ID # IC In the last decade many research have been carried

Power Efficient Sum of Absolute Difference Algorithms for video Compression

Hybrid Video Compression Using Selective Keyframe Identification and Patch-Based Super-Resolution

A Ripple Carry Adder based Low Power Architecture of LMS Adaptive Filter

An Efficient Design of Sum-Modified Booth Recoder for Fused Add-Multiply Operator

Comparative and performance analysis of HEVC and H.264 Intra frame coding and JPEG2000

MODULO 2 n + 1 MAC UNIT

Fast Block-Matching Motion Estimation Using Modified Diamond Search Algorithm

Design and Implementation of Signed, Rounded and Truncated Multipliers using Modified Booth Algorithm for Dsp Systems.

A Dedicated Hardware Solution for the HEVC Interpolation Unit

FPGA Implementation of Multiplierless 2D DWT Architecture for Image Compression

II. MOTIVATION AND IMPLEMENTATION

OPTIMIZATION OF AREA COMPLEXITY AND DELAY USING PRE-ENCODED NR4SD MULTIPLIER.

VLSI Design Of a Novel Pre Encoding Multiplier Using DADDA Multiplier. Guntur(Dt),Pin:522017

Design and Implementation of Low-Complexity Redundant Multiplier Architecture for Finite Field

By Charvi Dhoot*, Vincent J. Mooney &,

Efficient Method for Half-Pixel Block Motion Estimation Using Block Differentials

A Novel Design of 32 Bit Unsigned Multiplier Using Modified CSLA

A Low Power Asynchronous FPGA with Autonomous Fine Grain Power Gating and LEDR Encoding

Reducing Computational Time using Radix-4 in 2 s Complement Rectangular Multipliers

Fast Wavelet-based Macro-block Selection Algorithm for H.264 Video Codec

FPGA Implementation of a High Speed Multiplier Employing Carry Lookahead Adders in Reduction Phase

Survey Paper on HEVC Standard on Moving Object Detection in Compressed Domain to Analyze the Input Frame Verses Output Frame

ASIC IMPLEMENTATION OF 16 BIT CARRY SELECT ADDER

16 BIT IMPLEMENTATION OF ASYNCHRONOUS TWOS COMPLEMENT ARRAY MULTIPLIER USING MODIFIED BAUGH-WOOLEY ALGORITHM AND ARCHITECTURE.

Joint Adaptive Block Matching Search (JABMS) Algorithm

A New Fast Motion Estimation Algorithm. - Literature Survey. Instructor: Brian L. Evans. Authors: Yue Chen, Yu Wang, Ying Lu.

A High Quality/Low Computational Cost Technique for Block Matching Motion Estimation

A Modified Hardware-Efficient H.264/AVC Motion Estimation Using Adaptive Computation Aware Algorithm

COARSE GRAINED RECONFIGURABLE ARCHITECTURES FOR MOTION ESTIMATION IN H.264/AVC

Title Adaptive Lagrange Multiplier for Low Bit Rates in H.264.

Detecting and Correcting the Multiple Errors in Video Coding System

Toward Optimal Pixel Decimation Patterns for Block Matching in Motion Estimation

AN ADJUSTABLE BLOCK MOTION ESTIMATION ALGORITHM BY MULTIPATH SEARCH

Reducing/eliminating visual artifacts in HEVC by the deblocking filter.

A LOW-COMPLEXITY AND LOSSLESS REFERENCE FRAME ENCODER ALGORITHM FOR VIDEO CODING

VHDL IMPLEMENTATION OF FLOATING POINT MULTIPLIER USING VEDIC MATHEMATICS

An Efficient Mode Selection Algorithm for H.264

A COST-EFFICIENT RESIDUAL PREDICTION VLSI ARCHITECTURE FOR H.264/AVC SCALABLE EXTENSION

Complexity Reduced Mode Selection of H.264/AVC Intra Coding

ISSN (Online), Volume 1, Special Issue 2(ICITET 15), March 2015 International Journal of Innovative Trends and Emerging Technologies

Efficient Comparator based Sum of Absolute Differences Architecture for Digital Image Processing Applications

Power Optimized Programmable Truncated Multiplier and Accumulator Using Reversible Adder

Improving Energy Efficiency of Block-Matching Motion Estimation Using Dynamic Partial Reconfiguration

Design of 2-D DWT VLSI Architecture for Image Processing

DESIGN OF PARAMETER EXTRACTOR IN LOW POWER PRECOMPUTATION BASED CONTENT ADDRESSABLE MEMORY

Delay Optimised 16 Bit Twin Precision Baugh Wooley Multiplier

Star Diamond-Diamond Search Block Matching Motion Estimation Algorithm for H.264/AVC Video Codec

Built In Self Test For Multi Error Detection In Motion Estimation Computing Arrays

Reduced Frame Quantization in Video Coding

Design & Analysis of 16 bit RISC Processor Using low Power Pipelining

Area Efficient, Low Power Array Multiplier for Signed and Unsigned Number. Chapter 3

Fast Motion Estimation for Shape Coding in MPEG-4

Detecting and Correcting the Multiple Errors in Video Coding System

Design of a Multiplier Architecture Based on LUT and VHBCSE Algorithm For FIR Filter

A Sum Square Error based Successive Elimination Algorithm for Block Motion Estimation

JOURNAL OF INTERNATIONAL ACADEMIC RESEARCH FOR MULTIDISCIPLINARY Impact Factor 1.393, ISSN: , Volume 2, Issue 7, August 2014

Rate Distortion Optimization in Video Compression

An Efficient Constant Multiplier Architecture Based On Vertical- Horizontal Binary Common Sub-Expression Elimination Algorithm

An HEVC Fractional Interpolation Hardware Using Memory Based Constant Multiplication

FAST MOTION ESTIMATION DISCARDING LOW-IMPACT FRACTIONAL BLOCKS. Saverio G. Blasi, Ivan Zupancic and Ebroul Izquierdo

An Adaptive Video Compression Technique for Resource Constraint Systems

Implementation of CMOS Adder for Area & Energy Efficient Arithmetic Applications

A Novel Hexagonal Search Algorithm for Fast Block Matching Motion Estimation

IMPLEMENTATION OF LOW-COMPLEXITY REDUNDANT MULTIPLIER ARCHITECTURE FOR FINITE FIELD

DESIGN OF RADIX-8 BOOTH MULTIPLIER USING KOGGESTONE ADDER FOR HIGH SPEED ARITHMETIC APPLICATIONS

Comparative Study of Partial Closed-loop Versus Open-loop Motion Estimation for Coding of HDTV

Low Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm

Implimentation of A 16-bit RISC Processor for Convolution Application

Research Article Block-Matching Translational and Rotational Motion Compensated Prediction Using Interpolated Reference Frame

International Journal of Engineering and Techniques - Volume 4 Issue 2, April-2018

Design of Convolution Encoder and Reconfigurable Viterbi Decoder

the main limitations of the work is that wiring increases with 1. INTRODUCTION

Transcription:

IJCAES ISSN: 2231-4946 Volume III, Special Issue, August 2013 International Journal of Computer Applications in Engineering Sciences Special Issue on National Conference on Information and Communication (NCIC'13) www.caesjournals.org Area Efficient SAD Architecture for Block Based Video Compression Standards S. Archana 1, D. Rukmani Devi 2 1 M.E VLSI DESIGN, RMK Engineering College Chennai, India. 1 archansundar@gmail.com Abstract Block based motion estimation is one of the critical task in video compression standards such as MPEG-4, H.263, H.264. The key element of the block based motion estimation algorithms is the matching criteria. SAD(Sum of Absolute Difference) is the most common matching criteria chosen in video coding because of its low complexity, good performance and ease of hardware implementation. By utilizing the concept of reversibility a novel compression array unit using reversible 4:2 compressor has been proposed. Experimental results indicate that the proposed compression unit will impose effectiveness on the SAD architecture in calculating the motion vectors with reasonable area overhead and power penalty. Keywords Motion estimation, Sum of absolute difference, 4:2 Compressor, SAD architecture, Reversible gates. I. INTRODUCTION Motion estimation (ME) plays an important role in video coding and processing systems since motion vectors are critical information for temporal redundancy reduction. It has been widely employed in the MPEG-4; H.263 and H.264/MPEG-4 AVC block based video compression standards. Motion estimation is defined as searching the best motion vector which is the displacement of the coordinate of the best similar block in previous frame for the block in current frame. Block-based matching method is the most widely used motion estimation method for video compression since pictures are normally rectangular in shape and block-division can be easily done. Usually, standards bodies, e.g. MPEG, define the standard block sizes for motion estimation. This can be 16 by 16, 8 by 8, etc, depending on the target application of the video codec. The heavy computational cost of the block matching algorithms (BMA) can be a significant problem in real time coding applications. It obtains the best match by minimizing a cost function. Various matching criteria have been proposed and analyzed in the literatures, varying in complexity and efficiency. In this section, the mean absolute difference (MAD), mean square error (MSE), sum of absolute difference (SAD) and sum of absolute transformed difference (SATD) block matching criteria are explained[1]. Though MAD is simple and easy for implementation in hardware, unfortunately MAD tends to overemphasize small differences, giving an inferior result to MSE. MSE is accurate but its complexity is high for both software and hardware implementations. SATD, although it offers significant improvement in prediction quality, transform hardware must be added within the motion estimation loop and hardware complexity is increased. On the other hand, the latency and thus the performance of motion estimation will be degraded due to the added transform hardware. This technique is applied to H.264/AVC when performing quarter pixel accuracy motion estimation. The high accuracy is needed since it is the final stage of motion estimation. SAD is the most common matching criteria chosen in video coding because of its low complexity, good performance and ease of hardware implementation. The only difference between SAD and MAD is that SAD takes the sum of all pixels while MAD measures the average pixel value. Since the block size is constant during 1

subtraction, the average value per pixel is not important. A divide operation is saved and the overall computation is simplified. II. SAD ARCHITECTURE SAD is an extremely fast metric due to its simplicity; it is effectively the simplest possible metric that takes into account every pixel in a block. Therefore it is very effective for a wide motion search of many different blocks. SAD is also easily parallelizable since it analyzes each pixel separately, making it easily implementable. Consider the block size of a motion picture be M N, the location of the current block is (x, y) and the corresponding motion vector (MV) is (i, j). Then the SAD for the current block is defined as SAD x, y, i, j = M 1 N 1 u=0 v=0 A (x+u,y+v) B (x+i+u,y+j +v) The SAD architecture can be divided into three stages: absolute differences calculation, absolute differences accumulation, and minimum SAD determination. Unlike many hardware implementations based on complement adders, Instead of using the area-consuming adders, the LiYufei s [6] SAD architecture is implemented based on the comparison of unsigned numbers. Comparing to the reference architectures, the Li Yufei s architecture reduces the area significantly at a cost of negligible latency. LiYufei has implemented the compression array unit using 4:2 compressor made of pass transistor multiplexer. In this paper an attempt to modify the compression array unit using reversible gates has been made. The whole SAD architecture is shown in figure 1. Fig. 1 SAD Architecture A. Absolute Difference Calculation Unit The Absolute difference unit calculates the absolute differences between each pixel in the current block and the corresponding pixel in the block being compared. This unit implemented using comparators. In this paper the one and two bit comparators are referred as comp-1 and comp-2. Figure 2 is the one bit comparator comp-1 and two bit comparator comp-2 is shown in figure 3. Figure 4 is the whole absolute difference unit. The implementation using comparators has a reasonable area compared to the architectures that uses adders. Fig.2 COMP-1 The X and Y in [6] are implemented by X XOR GR and Y XOR LS signals respectively. The LS, GR and NTE represent the less, greater and not equal signals. In the compression array unit, the compression array is used to calculate the 16 groups of partial results composed of {a7, a6... a0}, {b7, b6 b0} and NTE. 2

Fig. 3 COMP-2 B. B. Proposed Compression Array Unit Fig. 4 Absolute Difference Calculation Unit All of the information encoded in the input must be preserved at the output in computational tasks such as digital signal processing, communication, computer graphics and cryptography. The loss of information is associated with laws of physics requiring that one bit of information lost dissipates k T ln 2 of energy, where k is Boltzmann s constant and T is the temperature of the system. Interest in reversible computation arises from the desire to reduce heat dissipation, thereby allowing higher densities and higher speed. Hence there are compelling reasons to consider circuits composed of reversible gates and the synthesis of such networks. Reversible circuits are circuits (gates) that have one to-one mapping between vectors of inputs and outputs; thus the vector of input states can be always reconstructed from the vector of output states. In this paper, a novel compression array unit has been designed using reversible 4:2 compressor from the DKGP gates. It has been proved that the SAD architecture proposed is reasonable than its counterpart existing in literature, in terms of power and area. This is the first attempt to design a compression array unit using a reversible 4:2 compressor as far as existing literature and our knowledge is concerned. 3

Fig. 5 Reversible 4:2 Compressor designed from DKGP gates The 4:2 compressor implemented using DKGP gates in [4] has been utilized to design the compression array unit. The four input bits which are given as input to the 4:2 compressor are compressed to two bits. The delay of 4:2 compressor is 3 XOR gates in series. The reversible 4:2 compressor designed from two DKGP gates and its block diagram are shown in Figure 5 and 6 respectively. Fig. 6 Block diagram of Reversible 4:2 Compressor The input of the compression array unit is composed of 16 groups of {a7,a6,a5,,0}, {b7,b6,b5,,0} and 3 recursive results SS, SC and SCO. Totally there are 35 groups of input. An AND operation is performed between an int signal and the partial results to initialize the recursive values. The whole compression array unit is shown in figure 8. The sixteen NTEs are compressed separately and the results are inserted into the compression tree. The NTE compression and initialization are shown in figure 7. In the compression tree the partial results are segregated as MSB and LSB and inserted into the tree correspondingly. The result of the compression array unit is carried over to the minimum SAD determination unit. Fig. 7 AND array initialization 4

For simplicity the sum, carry and carry out are only shown in the figure, the other outputs are not shown in the figure 7 and 8. Fig. 7 NTE compression Fig. 8 Reversible Compression array unit C. Minimum Sad Determination Unit The minimum SAD determination unit is composed of a 16-bit carry look ahead adder, 16 bit comparator and a multiplexer. The 16 bit comparator is composed of one and two bit less compressor. In this paper one and two bit less compressor are represented as cmpsr-1 and cmpsr-2. It is shown is figure 9 and 10 respectively. 5

Fig. 9 Cmpsr-1 The cmpsr-2 is composed of half compressor (HC) and an AND gate. The comparator operates in parallel. The 16 bit comparator and the whole minimum SAD unit are shown in figure 11 and 12 respectively. Fig. 10 Cmpsr-2 The area cost is reduced significantly due to the area optimization. LS signal denotes that the candidate SAD is less than the current SAD. If LS equals 1 the current SAD is replaced by the candidate. Fig. 11 16 bit comparator 6

III. Fig.12 Minimum SAD determination unit RESULTS AND DISCUSSION The proposed compression array unit using reversible 4:2 compressor is simulated using Xilinx ISE simulator and the area and power are calculated using Design Vision. The RTL schematic view of the proposed compression array unit is shown is figure 15. The output of the absolute difference unit and the recursive results are given as input to the compression array unit and area and power are calculated. The design vision reported the area and power as 7398 units and 5.1mW. The Area and power reports of the proposed compression array unit are shown in the figure 13 and 14. Fig. 13 Area report 7

Fig. 14 Power report Fig. 15 RTL Schematic of proposed Compression Array unit IV. CONCLUSION This paper presents the SAD architecture used to in estimation of motion vectors in the block based video compression standards. By utilizing the concept of reversibility a novel compression array unit using reversible 4:2 compressor has been proposed. Experimental results indicate that the proposed compression unit will impose effectiveness on the SAD architecture in calculating the motion vectors with reasonable area overhead and power penalty. 8

REFERENCES [1] Chen H and Shu Q, Apparatus for implementing a block matching algorithm for motion estimation in video image processing, U.S. Patent 5 864 372, Jan. 26, 1999. [2] Jarno Vanne, Ewro Aho, Timo D. Hämäläinen and Kimmo kuusilinna, A High-Performance Sum of Absolute Difference Implementation for Motion Estimation, IEEE Trans. on Circuits and Systems for Video Technology, Vol. 16, no. 7, July, 2006. [3] Jheng Y.S, Chen L.G and Chiueh T.D, An efficient and simple VLSI tree architecture for motion estimation algorithms, IEE Trans. On Signal Processing, Vol. 42, no. 2, Feb. 1993. [4] D. Krishnaveni, M. Geetha Priya and K. Baskaran, Design of an Efficient Reversible 8x8 Wallace Tree Multiplier, World Applied Sciences Journal 20(8):1159-1165, ISSN 1818-4952, 2012. [5] Ng Ka Ho (2008), High Efficient Motion Estimation Algorithms in Video Coding, City University of Hong Kong, March, 2008. [6] LiYufei, FengXiubo and WangQin (2007), A High Performance Low Cost SAD Architecture for Video Coding, IEEE Trans. On Consumer Electronics, Vol. 53, no. 2, MAY, 2007. [7] Thomas Wiegand Sullivan G.J, Gisle Bjøntegaard, and Ajay Luthra, Overview of the H.264/AVC Video Coding Standard, IEEE Trans. On Circuits and Systems for video technology, vo. 13, no.7, July, 2003. 9