A COST-EFFICIENT RESIDUAL PREDICTION VLSI ARCHITECTURE FOR H.264/AVC SCALABLE EXTENSION

Similar documents
BANDWIDTH-EFFICIENT ENCODER FRAMEWORK FOR H.264/AVC SCALABLE EXTENSION. Yi-Hau Chen, Tzu-Der Chuang, Yu-Jen Chen, and Liang-Gee Chen

High Performance VLSI Architecture of Fractional Motion Estimation for H.264/AVC

Adaptation of Scalable Video Coding to Packet Loss and its Performance Analysis

One-pass bitrate control for MPEG-4 Scalable Video Coding using ρ-domain

IN RECENT years, multimedia application has become more

Optimum Quantization Parameters for Mode Decision in Scalable Extension of H.264/AVC Video Codec

STACK ROBUST FINE GRANULARITY SCALABLE VIDEO CODING

FAST SPATIAL LAYER MODE DECISION BASED ON TEMPORAL LEVELS IN H.264/AVC SCALABLE EXTENSION

Fast Decision of Block size, Prediction Mode and Intra Block for H.264 Intra Prediction EE Gaurav Hansda

Unit-level Optimization for SVC Extractor

Scalable Video Coding

Performance Comparison between DWT-based and DCT-based Encoders

H.264/AVC und MPEG-4 SVC - die nächsten Generationen der Videokompression

On the impact of the GOP size in a temporal H.264/AVC-to-SVC transcoder in Baseline and Main Profile

Scalable Video Coding in H.264/AVC

CONTENT ADAPTIVE COMPLEXITY REDUCTION SCHEME FOR QUALITY/FIDELITY SCALABLE HEVC

Cross-Layer Optimization for Efficient Delivery of Scalable Video over WiMAX Lung-Jen Wang 1, a *, Chiung-Yun Chang 2,b and Jen-Yi Huang 3,c

A Novel Deblocking Filter Algorithm In H.264 for Real Time Implementation

Fast Mode Decision for H.264/AVC Using Mode Prediction

An Efficient Mode Selection Algorithm for H.264

Real-time and smooth scalable video streaming system with bitstream extractor intellectual property implementation

High Efficient Intra Coding Algorithm for H.265/HVC

ALGORITHM AND ARCHITECTURE DESIGN FOR INTRA PREDICTION IN H.264/AVC HIGH PROFILE

Department of Electrical Engineering, IIT Bombay.

A VLSI Architecture for H.264/AVC Variable Block Size Motion Estimation

EE Low Complexity H.264 encoder for mobile applications

Upcoming Video Standards. Madhukar Budagavi, Ph.D. DSPS R&D Center, Dallas Texas Instruments Inc.

Bit Allocation for Spatial Scalability in H.264/SVC

BANDWIDTH REDUCTION SCHEMES FOR MPEG-2 TO H.264 TRANSCODER DESIGN

An Improved H.26L Coder Using Lagrangian Coder Control. Summary

Deblocking Filter Algorithm with Low Complexity for H.264 Video Coding

Fast frame memory access method for H.264/AVC

MCTF and Scalability Extension of H.264/AVC and its Application to Video Transmission, Storage, and Surveillance

Fast Mode Decision Algorithm for Scalable Video Coding Based on Probabilistic Models

A SCALABLE COMPUTING AND MEMORY ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION ON FIELD-PROGRAMMABLE GATE ARRAYS. Theepan Moorthy and Andy Ye

SINGLE PASS DEPENDENT BIT ALLOCATION FOR SPATIAL SCALABILITY CODING OF H.264/SVC

Fraunhofer Institute for Telecommunications - Heinrich Hertz Institute (HHI)

ARCHITECTURES OF INCORPORATING MPEG-4 AVC INTO THREE-DIMENSIONAL WAVELET VIDEO CODING

TRADITIONALLY, architectural design focuses more on

EFFICIENT PU MODE DECISION AND MOTION ESTIMATION FOR H.264/AVC TO HEVC TRANSCODER

2014 Summer School on MPEG/VCEG Video. Video Coding Concept

Recent, Current and Future Developments in Video Coding

H.264/AVC Baseline Profile to MPEG-4 Visual Simple Profile Transcoding to Reduce the Spatial Resolution

Xin-Fu Wang et al.: Performance Comparison of AVS and H.264/AVC 311 prediction mode and four directional prediction modes are shown in Fig.1. Intra ch

Performance Analysis of DIRAC PRO with H.264 Intra frame coding

Smoooth Streaming over wireless Networks Sreya Chakraborty Final Report EE-5359 under the guidance of Dr. K.R.Rao

Improving Energy Efficiency of Block-Matching Motion Estimation Using Dynamic Partial Reconfiguration

An Efficient Intra Prediction Algorithm for H.264/AVC High Profile

Advanced Video Coding: The new H.264 video compression standard

A 4-way parallel CAVLC design for H.264/AVC 4 Kx2 K 60 fps encoder

EE 5359 MULTIMEDIA PROCESSING SPRING Final Report IMPLEMENTATION AND ANALYSIS OF DIRECTIONAL DISCRETE COSINE TRANSFORM IN H.

Implementation and analysis of Directional DCT in H.264

Optimal Bitstream Adaptation for Scalable Video Based On Two-Dimensional Rate and Quality Models

VIDEO COMPRESSION STANDARDS

EE 5359 Low Complexity H.264 encoder for mobile applications. Thejaswini Purushotham Student I.D.: Date: February 18,2010

Professor, CSE Department, Nirma University, Ahmedabad, India

Digital Video Processing

Analysis and Architecture Design of Variable Block Size Motion Estimation for H.264/AVC

Introduction to Video Encoding

Evaluating the Impact of Carry-Ripple and Carry-Lookahead Adders in Pel Decimation VLSI Implementations

Coding of Coefficients of two-dimensional non-separable Adaptive Wiener Interpolation Filter

High Efficiency Video Coding (HEVC) test model HM vs. HM- 16.6: objective and subjective performance analysis

Reduced 4x4 Block Intra Prediction Modes using Directional Similarity in H.264/AVC

A LOW-COMPLEXITY AND LOSSLESS REFERENCE FRAME ENCODER ALGORITHM FOR VIDEO CODING

Advances of MPEG Scalable Video Coding Standard

IMPROVED CONTEXT-ADAPTIVE ARITHMETIC CODING IN H.264/AVC

Title Adaptive Lagrange Multiplier for Low Bit Rates in H.264.

ESTIMATION OF THE UTILITIES OF THE NAL UNITS IN H.264/AVC SCALABLE VIDEO BITSTREAMS. Bin Zhang, Mathias Wien and Jens-Rainer Ohm

H.264 to MPEG-4 Transcoding Using Block Type Information

NEW CAVLC ENCODING ALGORITHM FOR LOSSLESS INTRA CODING IN H.264/AVC. Jin Heo, Seung-Hwan Kim, and Yo-Sung Ho

Homogeneous Transcoding of HEVC for bit rate reduction

A NOVEL SCANNING SCHEME FOR DIRECTIONAL SPATIAL PREDICTION OF AVS INTRA CODING

EE 5359 H.264 to VC 1 Transcoding

A Quantized Transform-Domain Motion Estimation Technique for H.264 Secondary SP-frames

International Journal of Emerging Technology and Advanced Engineering Website: (ISSN , Volume 2, Issue 4, April 2012)

Objective: Introduction: To: Dr. K. R. Rao. From: Kaustubh V. Dhonsale (UTA id: ) Date: 04/24/2012

Testing HEVC model HM on objective and subjective way

OVERVIEW OF IEEE 1857 VIDEO CODING STANDARD

Video Watermarking Algorithm for H.264 Scalable Video Coding

Comparative Study of Partial Closed-loop Versus Open-loop Motion Estimation for Coding of HDTV

MultiFrame Fast Search Motion Estimation and VLSI Architecture

Video Compression MPEG-4. Market s requirements for Video compression standard

Low-complexity transcoding algorithm from H.264/AVC to SVC using data mining

High Efficiency Video Coding. Li Li 2016/10/18

Complexity Reduced Mode Selection of H.264/AVC Intra Coding

Hardware Architecture Design of Video Compression for Multimedia Communication Systems

Fast Mode Assignment for Quality Scalable Extension of the High Efficiency Video Coding (HEVC) Standard: A Bayesian Approach

FPGA based High Performance CAVLC Implementation for H.264 Video Coding

Reducing/eliminating visual artifacts in HEVC by the deblocking filter.

For layered video encoding, video sequence is encoded into a base layer bitstream and one (or more) enhancement layer bit-stream(s).

Video Coding Standards. Yao Wang Polytechnic University, Brooklyn, NY11201 http: //eeweb.poly.edu/~yao

Adaptive Up-Sampling Method Using DCT for Spatial Scalability of Scalable Video Coding IlHong Shin and Hyun Wook Park, Senior Member, IEEE

An Efficient Hardware Architecture for H.264 Transform and Quantization Algorithms

Transcoding from H.264/AVC to High Efficiency Video Coding (HEVC)

Fast Transcoding From H.264/AVC To High Efficiency Video Coding

FAST MOTION ESTIMATION DISCARDING LOW-IMPACT FRACTIONAL BLOCKS. Saverio G. Blasi, Ivan Zupancic and Ebroul Izquierdo

WITH the growth of the transmission of multimedia content

CAMED: Complexity Adaptive Motion Estimation & Mode Decision for H.264 Video

Efficient MPEG-2 to H.264/AVC Intra Transcoding in Transform-domain

ERROR-ROBUST INTER/INTRA MACROBLOCK MODE SELECTION USING ISOLATED REGIONS

Transcription:

A COST-EFFICIENT RESIDUAL PREDICTION VLSI ARCHITECTURE FOR H.264/AVC SCALABLE EXTENSION Yi-Hau Chen, Tzu-Der Chuang, Chuan-Yung Tsai, Yu-Jen Chen, and Liang-Gee Chen DSP/IC Design Lab., Graduate Institute of Electronics Engineering, National Taiwan University, Taipei, Taiwan Email: {ttchen, peterchuang, cytsai, yjchen, lgchen}@video.ee.ntu.edu.tw TRACT In this paper, we propose a cost-efficient residual prediction hardware architecture to support inter-layer prediction in the state-of-art H.264/AVC scalable extension. Several residual prediction schemes are analyzed in hardware architecture and coding performances, and an integer motion estimation ()-simplified scheme is adopted. Then, the linearity of Hadamard transform is introduced to achieve data sharing for residual prediction in fractional motion estimation () architecture. An Hadamard-free residual prediction architecture is proposed with 40% cost saving compared to direct implementation of duplicating module. The proposed architecture is implemented with 86K gates at 220 MHz by UMC 90nm technology for encoding HDTV720p fps. The proposed design concept can be also applied in other VLSI designs and software acceleration for supporting residual prediction. 1. INTRODUCTION In former video coding standards, such as MPEG-2 and H.264, coding efficiency is always the main target. However, due to the prevalence of streaming multimedia applications over wired and wireless networks, more and more researchers pay attention to the functionality and scalability on video. Since late 2002, the Moving Picture Experts Group (MPEG) began to call for proposals of scalable video coding (SVC). These SVC proposals are merged into a Joint Scalable Video Model (JSVM) as the scalable extension of H.264/AVC [1][2]. In current JSVM, three main types of scalability, temporal scalability, spatial scalability, and SNR scalability are provided. Currently, SVC adopts multi-layer coding structure to provide various video scalabilities. Figure 1 shows the SVC encoder architecture by using a multi-scale pyramid with 2 levels of spatial scalability[1][2]. Spatial scalability is achieved by using different spatial resolution layers. To improve the coding performance of SVC, it is necessary to remove all the redundant information among the multi-layer coding structure. In H.264, the inter- and intra-prediction for single-layer Video Spatial decimation Layer 1 Motion-compensated and intra prediction Layer 0 Motion-compensated and intra prediction Inter-layer prediction : - intra - motion - residual SNR enhancement layer coding Base-layer entropy coding SNR enhancement layer coding Base-layer entropy coding Scalable bitstream Multiplex Fig. 1. SVC encoder structure with two spatial layers. coding scheme have been maturely developed. To remove the rest redundancy between spatial layers, Schwarz et al. proposed an inter-layer prediction scheme in [3]. The main concept is to remove the redundancy of motion and residual between base layer (BL, smaller frame resolution) and enhancement layer (EL, higher frame resolution) as shown in Fig. 1. The coding performance can be improved about 1 db compared to that without any inter-layer prediction tools as shown in Fig. 2. Although the inter-layer prediction can improve the coding performance, it nearly doubles the computation complexity since the current block s pixels are subtracted by upsampled residual block in residual prediction and it may lead to different prediction results. In following context, the prediction without base-layer residual is represented as normal prediction to distinguish from residual prediction. If the prediction engine which contains both normal prediction and residual prediction are directly implemented from previous H.264 designs [4, 5], the operating frequency will be doubled or the cost of processing elements will be largely increased. In this paper, we propose an integer motion estimation ()-simplified scheme and introduce linearity of

Hadamard transform into fractional motion estimation () architecture. An Hadamard-free residual prediction architecture which can parallel process normal prediction and residual prediction is proposed. Compared to direct implementation, the proposed architecture can save 80% hardware cost and maintain the same processing capability according to referenced designs. 2. INTER-LAYER RESIDUAL PREDICTION 2.1. Residual Prediction in SVC 28 27 Harbour 4CIF over 65 frames, Qp= for QCIF/CIF JSVM 8.11 As described in [1, 3], inter-layer prediction can be regarded as three aspects, motion vector prediction, intra block prediction and residual prediction. Inter-layer residual prediction uses the upsampled BL residual information to reduce the information of EL residual. To provide best coding performance, the above inter-layer prediction tools are arbitrarily selected after comparing to the rate-distortion results of normal prediction. While inter-layer intra prediction only increases one search candidate in intra prediction, the residual prediction in JSVM performs during whole and stages to find a best matched candidate as shown in Fig. 3. It greatly increases the computation complexity of SVC encoder since the computation of and are nearly doubled. Therefore, in this paper, we focus on the cost-efficient VLSI architecture of inter-layer residual prediction. 2.2. Design Challenge of Residual Prediction In many previous H.264/AVC hardware designs[6, 7, 8], the hardware designs of and are well-investigated. However, none of these designs consider residual prediction. To achieve both normal and residual prediction as shown in Fig. 3 by above designs, the computation cycles will be doubled since they can not parallel processing normal prediction and normal prediction. It nearly doubles the required operating frequency of reference designs while supporting inter-layer prediction in SVC. Because the processing cycles for each MB becomes more and more critical in encoding HDTV resolution and parallel processing may achieve higher data reuse to save power, parallel processing for both normal and residual prediction schemes is a better design strategy. However, to maintain the same processing capability, the processing elements (PE) of above designs should be doubled and it leads to a large hardware cost overhead from residual prediction. For example, the hardware cost of sum-of-absolute-difference (SAD) tree in [6, 7] will be doubled, and the number of 4 4 processing unit (PU) and interpolation unit adopted in [4, 8] will be doubled, too. How to efficiently reduce the hardware overhead of residual prediction becomes an necessity for future SVC designs. 26 10 1800 20 2800 30 37 36 (a)harbour Soccer 4CIF over 65 frames, Qp= for QCIF/CIF JSVM 8.11 700 1200 1700 2200 36 28 27 (b)soccer City 4CIF over 65 frames, Qp= for QCIF/CIF JSVM 8.11 500 1000 1500 2000 (c)city Fig. 2. PSNR comparison of inter-layer prediction(ilp) for the luminance part of (a) Harbour (b) Soccer (c) City 4CIF fps over 65 frames with GOP=16 by JSVM 8.11 hierarchical B-frame coding; ILP : inter-layer prediction; case 1 is to check residual prediction only on normal prediction s results

Upsampled residual block Current block - Current block* Normal Prediction Different search candidate block Residual Prediction MB Decision Fig. 3. Computation procedure of inter-layer residual prediction in JSVM encoder. Upsampled residual block Current block - Current block* Normal Prediction Integer MVs Residual Prediction Same search candidate block MB Decision Fig. 4. Computation procedure of Proposed -simplified inter-layer residual prediction. 2.3. Simplified Residual Prediction Scheme Figure 2 demonstrates the coding performance comparison of inter-layer prediction under different residual prediction constraints for sequence Harbour, Soccer and City at 4CIF fps, GOP = 16 by 4-level hierarchical B-frame decomposition. FGS is not involved into our analysis and QPs for QCIF/CIF are set to. When all inter-layer prediction tools are enabled, it has best coding performance but the hardware cost are doubled compared to normal prediction only. Case 1 represents that the residual prediction is only performed on the normal prediction results. Although only increase few hardware processing cycles on conventional H.264 designs, about 0.4 to 1 db degradation is observed from simulation results in Fig. 2. Based on the inference in [3] that residual prediction is more likely to improve coding performance while the motion vector (MV) for the block of current layer are similar to the MV of the corresponding base layer block, we assume that the normal prediction of can effectively find such MV MV motion trend and residual prediction is more efficient during fractional motion refinement. Thus we propose an simplified residual prediction scheme. The of residual prediction in Fig. 3 is removed, and the integer MV from normal prediction are input to for residual prediction refinement. By the adopted -simplified scheme, the coding performance can be the same as all ILP tools enable at low and medium bitrate and at most 0.2 db degradation at higher bitrate. Fig. 4 demonstrates our proposed residual prediction scheme. Since the same integer MVs are refined for both predictions, the loaded search candidates data are the same for both predictions module. Thus, we will focus on integrating residual prediction into architecture in following sections. 3. LINEARITY OF HADAMARD TRANSFORM In most algorithm and hardware design, sum-of-absolutetransformed-difference () by Hadamard transform is used as cost evaluation value for rate-distortion optimization to provide better coding performance, and this Hadamard transform module possesses about 40 to 50% cost [4, 8]. Since Hadamard transform is a linear transformation, it has following linear properties, H(A + B) = HA + HB, (1) where H represents the 2-D Hadamard transform, and both A and B are square matrixes. Based on above linearity, it is possible to implement data reuse between normal prediction and residual prediction. In general, the of each 4 4 block of normal prediction and residual prediction () can be modeled as (2) and (3), respectively. SAT D n,4 4 = H(D n,4 4 ) = H(cur 4 4 ref n,4 4 ), (2) SAT D n,4 4, = H(D n,4 4, ) = H(cur 4 4 res 4 4 ref n,4 4 ), where represents the data for residual prediction, n is the index of current search candidate, cur 4 4, ref n,4 4, res 4 4 are the current block, interpolated reference block and upsampled residual block, respectively. Based on the linearity of Hadamard transform, it is possible to substitute the term of (3) by (2) as follows, H(D n,4 4, ) = H(cur 4 4 res 4 4 ref n,4 4 ) = H(cur 4 4 ref n,4 4 ) H(res 4 4 ) (3) = H(D n,4 4 ) H(res 4 4 ). (4) Thus, the Hadamard transformed coefficients in (2) can now be reused to calculate cost of residual prediction without performing the Hadamard transform in (3) again. Since each

Search Window Data Intepolation Current MB Data Upsampled Residual MB Data 4x4 Residual Hadamard PE 4x1 n,4x1 n,4x1, PE1 PE 2 PE 9 PE 1 PE 2 PE 9 (b) PE x4 MV Cost Generator Mode Cost Generator Comparator Best MV Desicion Best Mode Desicion (a) Comparator 1-D Hadamard 1-D Hadamard (c) Transformed coefficients x4 Fig. 5. (a)block diagram of proposed residual prediction hardware (b) architecture of residual prediction () PE (c) architecture of 4 4 PE. in above figures means absolute value. search candidate applies the same upsampled residual block for residual prediction, the transformed coefficients of 4 4 residual block, H(res 4 4 ), can be shared by every search candidate simultaneously. 4. PROPOSED RESIDUAL PREDICTION VLSI ARCHITECTURE In this paper, we take the architecture in [4] as design example. Please note that the concept of proposed Hadamardfree architecture can be also applied in other existing design [8]. The proposed architecture with residual prediction is shown in Fig. 5. The region surrounded by dashed line is specified for residual prediction. The blue arrows and red arrows represent the data flow of transformed coefficients of differences and upsampled residual data, respectively. Originally, the SAT D of each search candidate requires one 4 4 PE as shown in Fig. 5(c). Based on (4), the Hadamard-transformed coefficients of difference are reused to derive SAT D, and the transformation for residual prediction () can be eliminated. Thus, in our proposed PE in Fig. 5(b), the two 1-D Hadamard modules and 4 4 transposed registers are removed, and it can largely reduce the hardware cost compared to direct implementation. Since the transformed coefficients of residual block are required in (4), one 4 4 Hadamard transform PE in Fig. 5(c) is added. Besides, since the integer motion vectors for normal prediction and residual prediction are the same, the output of interpolation unit can be shared. It also save cost of one additional interpolation unit. Table 1. Implementation results of proposed architecture with residual prediction function based on [4] TSMC 0.18um @100MHz UMC 90nm @220MHz [4] Proposed [4] Proposed Interpolation Unit 25501 25501 25593 25593 MVCost Core 50 50 5438 5424 4 4 PE 9 863 865 391 398 PE 9 0 8484 0 8891 Res. Transform PE 0 76 0 2870 Others 5465 9399 5670 9663 Total 70860 865 70092 85839 unit : NAND2 gate count 5. IMPLEMENTATION RESULTS Based on our previous design [4] for H.264/AVC baseline profile, an architecture of with Hadamard-free residual prediction has been proposed and synthesized by UMC 90nm technology and TSMC 0.18um technology, respectively. Since our proposed architecture can parallel process both normal prediction and residual prediction, the processing capability can be the same as the referenced design. According to [4], each macroblock requires about 2000 cycles. Our proposed architecture can support 4CIF fps in 100MHz and 720p fps in 220MHz. The gate count distributions of two technologies are listed in Table 1. Based on Table 1, the proposed architecture can achieve both normal prediction and

residual prediction with only (85839 70092)/70092 = 22% hardware cost compared to original [4]. Please note that [4] can not process normal prediction and residual prediction simultaneously. Comparing to directly duplicating [4] for parallel processing normal and residual prediction, the proposed architecture can save the cost of 4 4 PE for residual prediction about 21000 gate counts and reuse the interpolation unit. Based on synthesized results in Table 1, (70092 2 85839)/(70092 2) = 39% hardware cost can be saved. [8] Y.-J. Wang, C.-C. Cheng, and T.-S. Chang, A fast algorithm and its VLSI architecture for fractional motion estimation for H.264/MPEG-4 AVC video coding, IEEE Trans. on CSVT, vol. 17, no. 5, pp. 578 583, May 2007. 6. CONCLUSION In this paper, we propose a cost efficient residual prediction architecture for H.264/AVC scalable extension. The simplified scheme is adopted with less than 0.2 db quality loss, and the linearity of Hadamard transform is applied to reduce the hardware cost while integrating residual prediction. Compared to directly duplicated implementation, it can reduce about 40% hardware cost. Moreover, the proposed design concept can be applied in other designs and software acceleration to support inter-layer prediction. 7. REFERENCES [1] ISO/IEC JTC1, Joint Draft 8 of SVC Amendment, ISO/IEC JTC1/SC/WG11 and ITU-T SG16 Q.6, Doc. JVT-U201, Oct. 2006. [2] ISO/IEC JTC1, Joint Scalable Video Model 8.0, ISO/IEC JTC1/SC/WG11 and ITU-T SG16 Q.6, Doc. JVT-U202, Oct. 2006. [3] H. Schwarz, D. Marpe, and T. Wiegand, SVC core experiment 2.1: Inter-layer prediction of motion and residual data, ISO/IEC JTC1/SC/WG11/M11043, July 2004. [4] T.-C Chen, Y.-W. Huang, and L.-G. Chen, Fully utilized and reusable architecture for fractional motion estimation of H.264/AVC, in Proceedings of IEEE ICASSP, May 2004, pp. V 9 V 12. [5] Y.-W. Huang and et al., A 1.3tops H.264/AVC singlechip encoder for HDTV applications, in Proc. of IEEE ISSCC, 2005, pp. 128 588. [6] T.-C. Chen and et al., Analysis and architecture design of an HDTV720p frames/s H.264/AVC encoder, IEEE Trans. on CSVT, vol. 16, no. 6, pp. 673 688, June 2006. [7] Z. Liu, Y. Song, T. Ikenaga, and S. Goto, A pipeline parallel tree architecture for full search variable block size motion estimation in H.264/AVC, in Proc. of Picture Coding Symposium, 2006.