SAD implementation and optimization for H.264/AVC encoder on TMS320C64 DSP

Similar documents
International Journal of Emerging Technology and Advanced Engineering Website: (ISSN , Volume 2, Issue 4, April 2012)

Performance Analysis of DIRAC PRO with H.264 Intra frame coding

Performance Analysis of H.264 Encoder on TMS320C64x+ and ARM 9E. Nikshep Patil

Emerging H.26L Standard:

Efficient MPEG-2 to H.264/AVC Intra Transcoding in Transform-domain

Performance Comparison between DWT-based and DCT-based Encoders

An Efficient Mode Selection Algorithm for H.264

Upcoming Video Standards. Madhukar Budagavi, Ph.D. DSPS R&D Center, Dallas Texas Instruments Inc.

EE 5359 Low Complexity H.264 encoder for mobile applications. Thejaswini Purushotham Student I.D.: Date: February 18,2010

Optimized architectures of CABAC codec for IA-32-, DSP- and FPGAbased

An Improved H.26L Coder Using Lagrangian Coder Control. Summary

An Efficient Table Prediction Scheme for CAVLC

Advanced Video Coding: The new H.264 video compression standard

Optimal DSP Based Integer Motion Estimation Implementation for H.264/AVC Baseline Encoder

Title Adaptive Lagrange Multiplier for Low Bit Rates in H.264.

A Novel Deblocking Filter Algorithm In H.264 for Real Time Implementation

EE Low Complexity H.264 encoder for mobile applications

Reducing/eliminating visual artifacts in HEVC by the deblocking filter.

Fast frame memory access method for H.264/AVC

A Video CoDec Based on the TMS320C6X DSP José Brito, Leonel Sousa EST IPCB / INESC Av. Do Empresário Castelo Branco Portugal

Reduced Frame Quantization in Video Coding

Pattern based Residual Coding for H.264 Encoder *

A Dedicated Hardware Solution for the HEVC Interpolation Unit

Fast Decision of Block size, Prediction Mode and Intra Block for H.264 Intra Prediction EE Gaurav Hansda

RECOMMENDATION ITU-R BT

IMPLEMENTATION OF H.264 DECODER ON SANDBLASTER DSP Vaidyanathan Ramadurai, Sanjay Jinturkar, Mayan Moudgill, John Glossner

H.264 to MPEG-4 Transcoding Using Block Type Information

Reduced 4x4 Block Intra Prediction Modes using Directional Similarity in H.264/AVC

IBM Research Report. Inter Mode Selection for H.264/AVC Using Time-Efficient Learning-Theoretic Algorithms

STUDY AND IMPLEMENTATION OF VIDEO COMPRESSION STANDARDS (H.264/AVC, DIRAC)

Ittiam Systems (Pvt.) Ltd.,

Optimizing the Deblocking Algorithm for. H.264 Decoder Implementation

Realtime H.264 Encoding System using Fast Motion Estimation and Mode Decision

Coding of Coefficients of two-dimensional non-separable Adaptive Wiener Interpolation Filter

Deblocking Filter Algorithm with Low Complexity for H.264 Video Coding

A Quantized Transform-Domain Motion Estimation Technique for H.264 Secondary SP-frames

H.264/AVC Video Encoder Realization and Acceleration on TI DM642 DSP

2014 Summer School on MPEG/VCEG Video. Video Coding Concept

High Performance VLSI Architecture of Fractional Motion Estimation for H.264/AVC

H.264/AVC Baseline Profile to MPEG-4 Visual Simple Profile Transcoding to Reduce the Spatial Resolution

IP Video Phone on DM64x

Introduction to Video Encoding

NEW CAVLC ENCODING ALGORITHM FOR LOSSLESS INTRA CODING IN H.264/AVC. Jin Heo, Seung-Hwan Kim, and Yo-Sung Ho

White paper: Video Coding A Timeline

LIST OF TABLES. Table 5.1 Specification of mapping of idx to cij for zig-zag scan 46. Table 5.2 Macroblock types 46

Intra Prediction Efficiency and Performance Comparison of HEVC and VP9

Information technology Coding of moving pictures and associated audio for digital storage media at up to about 1,5 Mbit/s

A COST-EFFICIENT RESIDUAL PREDICTION VLSI ARCHITECTURE FOR H.264/AVC SCALABLE EXTENSION

Video Coding Standards. Yao Wang Polytechnic University, Brooklyn, NY11201 http: //eeweb.poly.edu/~yao

Video Coding Standards: H.261, H.263 and H.26L

Improving Energy Efficiency of Block-Matching Motion Estimation Using Dynamic Partial Reconfiguration

Research Article A High-Throughput Hardware Architecture for the H.264/AVC Half-Pixel Motion Estimation Targeting High-Definition Videos

ERROR-ROBUST INTER/INTRA MACROBLOCK MODE SELECTION USING ISOLATED REGIONS

Accelerated Motion Estimation of H.264 on Imagine Stream Processor

Optimum Quantization Parameters for Mode Decision in Scalable Extension of H.264/AVC Video Codec

ABSTRACT. KEYWORD: Low complexity H.264, Machine learning, Data mining, Inter prediction. 1 INTRODUCTION

Analysis of Motion Estimation Algorithm in HEVC

Review and Implementation of DWT based Scalable Video Coding with Scalable Motion Coding.

Digital Video Processing

VHDL Implementation of H.264 Video Coding Standard

The Scope of Picture and Video Coding Standardization

STACK ROBUST FINE GRANULARITY SCALABLE VIDEO CODING

Objective: Introduction: To: Dr. K. R. Rao. From: Kaustubh V. Dhonsale (UTA id: ) Date: 04/24/2012

Vector Bank Based Multimedia Codec System-on-a-Chip (SoC) Design

Multimedia Decoder Using the Nios II Processor

Video Compression An Introduction

Comparative and performance analysis of HEVC and H.264 Intra frame coding and JPEG2000

ARTICLE IN PRESS. Signal Processing: Image Communication

A COMPARISON OF CABAC THROUGHPUT FOR HEVC/H.265 VS. AVC/H.264. Massachusetts Institute of Technology Texas Instruments

Modeling and Simulation of H.26L Encoder. Literature Survey. For. EE382C Embedded Software Systems. Prof. B.L. Evans

Scalable Multi-DM642-based MPEG-2 to H.264 Transcoder. Arvind Raman, Sriram Sethuraman Ittiam Systems (Pvt.) Ltd. Bangalore, India

Complexity Reduced Mode Selection of H.264/AVC Intra Coding

Intra Prediction Efficiency and Performance Comparison of HEVC and VP9

FAST MOTION ESTIMATION DISCARDING LOW-IMPACT FRACTIONAL BLOCKS. Saverio G. Blasi, Ivan Zupancic and Ebroul Izquierdo

Introduction to Video Compression

Comparative Study of Partial Closed-loop Versus Open-loop Motion Estimation for Coding of HDTV

MultiFrame Fast Search Motion Estimation and VLSI Architecture

Laboratoire d'informatique, de Robotique et de Microélectronique de Montpellier Montpellier Cedex 5 France

Module 7 VIDEO CODING AND MOTION ESTIMATION

DIGITAL TELEVISION 1. DIGITAL VIDEO FUNDAMENTALS

Homogeneous Transcoding of HEVC for bit rate reduction

Video Coding Using Spatially Varying Transform

HEVC The Next Generation Video Coding. 1 ELEG5502 Video Coding Technology

EFFICIENT PU MODE DECISION AND MOTION ESTIMATION FOR H.264/AVC TO HEVC TRANSCODER

TMS320C62x, TMS320C67x DSP Cache Performance on Vocoder Benchmarks

High Efficiency Video Coding (HEVC) test model HM vs. HM- 16.6: objective and subjective performance analysis

Implementation of H.264 Video Codec for Block Matching Algorithms

Design of a High Speed CAVLC Encoder and Decoder with Parallel Data Path

Performance and Complexity Co-evaluation of the Advanced Video Coding Standard for Cost-Effective Multimedia Communications

Implementation and analysis of Directional DCT in H.264

IN RECENT years, multimedia application has become more

OVERVIEW OF IEEE 1857 VIDEO CODING STANDARD

High Efficiency Video Coding: The Next Gen Codec. Matthew Goldman Senior Vice President TV Compression Technology Ericsson

Video Coding Standards

Overview, implementation and comparison of Audio Video Standard (AVS) China and H.264/MPEG -4 part 10 or Advanced Video Coding Standard

Fast Wavelet-based Macro-block Selection Algorithm for H.264 Video Codec

Advanced Encoding Features of the Sencore TXS Transcoder

EE 5359 MULTIMEDIA PROCESSING SPRING Final Report IMPLEMENTATION AND ANALYSIS OF DIRECTIONAL DISCRETE COSINE TRANSFORM IN H.

A 4-way parallel CAVLC design for H.264/AVC 4 Kx2 K 60 fps encoder

CONTENT ADAPTIVE COMPLEXITY REDUCTION SCHEME FOR QUALITY/FIDELITY SCALABLE HEVC

Transcription:

SETIT 2007 4 th International Conference: Sciences of Electronic, Technologies of Information and Telecommunications March 25-29, 2007 TUNISIA SAD implementation and optimization for H.264/AVC encoder on TMS320C64 DSP M. A. BEN AYED, A. SAMET, N. MASMOUDI Electronics and Information Technology Laboratory, University of Sfax, National School of Engineering. BP W 3038 Sfax, TUNISIA Mohamedali.benayed@isecs.rnu.tn uri.masmoudi@enis.rnu.tn Abstract: Motion estimation in video coding standards, such as H.264/AVC, is considered to be the most timeconsuming encoding module. Motion estimation is generally performed on a 16x16 block, although in H.264/AVC, 7 different block sizes (16x16, 16x8, 8x16, 8x8, 8x4, 4x8 and 4x4) are allowed. The aim of this paper is to optimise the implementation of the motion estimation algorithm on the Texas Instruments TMS320C64 DSP. Specifically, the goal is to use the C64 set of instructions in order to optimise the Sum of Absolute Differences (SAD) engine within the motion estimation and to take advantage of the Direct Memory Access (DMA) to reduce the cycle cost in loading data from external to internal memory. Standard Assembly (SA) is used to implement the different SAD functions in order to exploit the C64 internal architecture and resources efficiently. Experimental results shows more than 75% improvement in terms of cycle cost compared to C code for each function. Key words: H.264/AVC, Motion estimation, SAD, TMS320C64, SA. INTRODUCTION The recently standardized H.264/MPEG-4 AVC video coder [THO 02] (formerly known as ITU-T H.26L) is the result of the work carried out by a Joint Video Team (JVT) part of the International Telecommunication Union (ITU-T VCEG Video Coding Experts Group) and of the International Organization for Standardization (ISO/IEC MPEG Moving Picture Experts Group). This upcoming standard is called to play an important role in the Broadcasting market since it provides advances in digital video implementations in terms of bit rate reduction, transmission resiliency and video quality. However, the impressive coding efficiency of H.264/AVC comes at the expense of significantly increased algorithmic complexity compared to existing standards, which has limited the availability of cost-effective, high-performance solutions [VAN 04]. In fact, most of the existing real-time encoders for H.264/AVC are implemented on a DSP platform due its software flexibility for being upgraded, relatively low software development cost, and time-to-market reduction. We will implement our SAD engine on the C64 DSP from TI since it is the most suited and architectured for multimedia applications. From recent works [HOR 03], the SAD engine consumed most of the encoding time either in the inter-coding for motion estimation or intra-coding module. Given that the H.264/AVC allows 7 different block sizes 16x16, 16x8, 8x16, 8x8, 8x4, 4x8 and 4x4 resulting in 16 different functions to be implemented using a standard assembly (SA) description. Each function has its own characteristics and properties in terms of data transfer and computational dependency. We will compare our results to the C code. Our paper is structured as follows: next section describes the internal architecture of our platform, which is TI C64. Section three details the implementation strategies and illustrates the implementation schema for a particular function. Experimental results and discussion are presented in section four. Finally, section five concludes this paper with some constructive perspectives. 1. C64 internal architecture and main functions 2.1. Overview Tuning the video codec software for DSP implementation involves several steps. Traditional development flows in the DSP industry includes the following. Construct a C model for validating purpose. As the modern DSP compilers become more mature, - 1 -

they can do part of the laborious work of instruction selection, parallelizing, pipelining, and register allocation. However, we still often find that the compilers are making mistakes from time to time. In addition, in order to make the final code more compact in size and faster in speed, the C codes have to be tuned to match the DSP architecture. Figure 1 shows the typical three-step DSP code development flow [TEX 01]. For porting to DSP, the data type shall be first considered since the definition of data such as integer can be different for different processors. For example, on TI C6000, the long integer means 40 bits. Since the H.264/AVC codec deals with 8-bit pixels, the programmer can use the short data type for fixed-point multiplication, which takes only one cycle. Further optimizations shall make maximal use of all the hardware resources in the critical loops. 2.2. C64 internal architecture The TMS320C64x is a fixed point DSP features very long instruction word (VLIW) architecture developed by TI (VelociTI) [TEX 00]. This architecture is a high-performance, advanced, making these DSPs excellent choices for multimedia and multi-function applications such as MPEG4 encoder. VelociTI, together with the development tool set and evaluation tools, provides faster development time and higher performance for embedded DSP applications through increased instruction-level parallelism. The C6416 processor consists of three main parts: CPU (or the core), peripherals, and memory. Eight functional units operate in parallel, with two similar sets of the basic four functional units. The units communicate using a cross path between two register files. Program parallelism is defined at compile time because there is no data dependency checking done in hardware during run time. The 256-bit-wide program memory fetches eight 32-bit instructions every single cycle together with the 64-bit-wide data bus suitable for 8 pixels download/storage. All of these features make the C6416 the most suited DSP for video processing. Figure 2 illustrates the internal architecture of the C64. Each functional unit has its own 32-bit write port into a general-purpose register file. All units ending in 1 (for example,.l1) write to register file A and all units ending in 2 write to register file B. Each functional unit has two 32-bit read ports for source operands src1 and src2. Four units (.L1,.L2,.S1, and.s2) have an extra 8-bit-wide port for 40-bit long writes as well as an 8-bit input for 40-bit long reads. Because each unit has its own 32-bit write port, all eight units can be used in parallel every cycle. The most interesting instructions that will be exploited in the implementation of SAD engines are: - LDDW: Load double word (64 bits). - LDNDW: Load non-aligned double word. - STDW: Stock double word. - PACK: Packtisation of 4 pixels. - DOTPU4: Dot product of unsigned 4-4 pixels. - SUBABS: Sum of absolute difference of 4 by 4 pixels. - AVGU4: Averaging unsigned 4-4 pixels. For further information please refer to [TEX 02]. Phase 1: Develop C code Phase 2: Refine C code Phase 3: Write linear assembly Write C code Compile Refine C code Compile More C optimisation Write linear assembly Assembly optimize Figure. 1. DSP code development flow. 2. Functions description and optimization technique The most commonly used metric to evaluate the match is the (SAD), which adds up the absolute differences between corresponding elements in the macroblocks. It is given by the following formula: 4,8,16 4,8,16 SAD ( x, y, r, s) = A( x+ i, y+ j) B( ( x+ r) + i, ( y+ s) + j) i= 0 j= 0 Where 0 < x ; y < frame size, (r; s) being the motion vector, A(x; y) being a current frame pel at (x; y), and, B(x; y) being a reference frame pel at (x; y). Since H.264/AVC permits 7 different block sizes and up to quarter-pel precision for different mode of access (horizontal and vertical), Ublive software developed by UBvideo Inc. [UBV], which is an encoder highly optimized algorithmically, enabling it to achieve objective and subjective performance levels close to the public JM encoder with significantly reduced time complexity, uses 16 different functions to implement the SAD engine. Those 16 functions included in the SAD engine are presented in Table 1. In order to exploit the hardware resources on the C64 and to take advantage of its overall architecture, we have to describe each function by a SA code. For that reason, we shall take into our consideration the following approaches: - 2 -

- Examining the C code provided by Ubvideo Inc. for each function and draws the data dependency. - Partitioning data into 2 sectors, one will be processed by A side the other by B side of the CPU. - Downloading data from memory to CPU should be performed using 8 pixels transfer at a time. This is possible only for the case where source and reference frame are 16 or 8 pixels wide. Otherwise, perform the download by 4 pixels at a cycle. H.264/AVC permits 7 different block size (16x16, 16x8, 8x16, 8x8, 8x4, 4x8, and 4x4) - Use the SUBABS instruction on the downloaded source and reference frame. - Unroll the inner loop whenever is possible, experimental results showed a major gain in speed in terms of cycle count. - During our optimisation process, no PSNR change is permitted. In other word, our results have to confirm with C code exactly. 3. Experimental results and discussion Our experiments are carried out on a DSK 6416 running at 720 Mhz. This board serves as a hardware reference design for the TMS320C6416 DSP. Figure 3 illustrates the general description for sad16xnv2 function. The purpose of this function is to do SAD between four buffers (pointed to by A_cur and B_ref). This block are logical perceived as 2- dimensional 16x16, and each one is divided on 4 blocks 8x8 which are calculated separately and stored into B_sad_matrix array. Table 2 illustrates that we have reached more than 75% optimisation compared to the original C code and in some functions we got up to 85% in terms of cycles count. This is due to the fact that we have exploited all DSP resources and controlled the data transfer adequately. Table 3 illustrates the experimental results for SA and C code over all encoder. It is clear that we get 104% increase in the encoding speed in terms of frames/sec compared to C code. This is considered to be excellent results since Ublive code is supposed to be the most optimal code on the market. 4. Conclusion In this paper, we have considered to implement efficiently the SAD engine on C64 DSP in order to reduce the time consumption for H.264/AVC motion estimation module. Since H.264/AVC offers 7 different block sizes, 16 SAD functions have been implemented using a well-optimised SA code that enables us to benefit from the internal architecture of C64, which is considered to be the most suitable for any real-time multimedia application. Up to 75% reduction in terms of cycle count is obtained compared to the C code. As perspectives, we can optimise other modules that are time consuming like the interpolation module, and intra-prediction module. REFERENCES [THO 02] Thomas W, "Study of Final Committee Draft of Joint Video Specification", ITU-T Rec. H.264 ISO/IEC 14496-10 AVC, Draft 1, December, 2002. [VAN 04] Vanghn Iverson, Jeff Mc Veigh, Bob Reese, "Real-Time H.264/AVC Codec On Intel Architectures", International Conference on Image Processing ICIP, pp. 757-760, 2004. [HOR 03] Horowitz M., Joch A., Kossentini F., Hallapuro A., "H.264/AVC baseline profile decoder complexity analysis", Circuits and Systems for Video Technology, IEEE Transactions, Volume 13, Issue 7, July 2003. [TEX 01] Texas Instruments, "TMS320C6000 Programmer Guide", 2001. [TEX 00] Texas Instruments, "TMS320C6000 CPU and Instruction Set Reference Guide", SPRU189, 2000. [TEX 02] Texas Instruments, "TMS320DM642 Video/Imaging Fixed-Point Digital Signal Processor", SPRS200A, 2002. [UBV] UBvideo, www.ubvideo.com. Figure 2. Internal C64 Architecture. - 3 -

Execute loop n times 1 A_cur B_cur 16 pixels A_r ef 16 pixels B_ref A_ w B_w A_cur(post incrimenté par A_w) B_cur(poste incrimenté par A_w) A_ref (post incrimenté par B_w) B_ref (post incrimenté par B_w) A_curpix A_curpix1 B_curpix B_curpix1 A_ref pix A_ref pix1 B_ref pix B_ref pix1 A_curpix A_curpix1 B_curpix B_curpix1 la valeur absolue 8 bits_8 bits la valeur absolue 8 bits_8 bits A_ref pix A_ref pix1 B_ref pix B_ref pix1 = = = = A_sad 1 B_sad 1 + + + + B_sad 1 A_sad 1 = = = = Repeat loop + = B_sadArray B_sad1 A_sad1 + B_sad1 = B_sad1 Figure 3. Data flow for the sad16xnv2 function on the C64 Platform. Function Name qp_sadmxnh2 sadmxnv2 sparse_sad16x4 qp_sadmxnv2 sadmxnh2 split_sad8x4 split_sad8x8 Table 1. Different SAD functions and their description. Description This function calculates the current and the right SAD values for 1/4 pixel mxn blocks. The difference is taken between the source block pixel and the average with rounding of the corresponding block pixels in the 2 reference buffers. Functions are available for m = 16, 8, and 4. This function calculates 2 mxn SADs for the bottom and the current macroblocks. Functions are available for m = 16, 8, and 4. This function calculates the 4 16x4 SADs of 7 horizontal positions for a macroblock. The SADs are stored into the sad16x4 array. This function calculates the current and the bottom SAD values for 1/4 pixel mxn blocks. The difference is taken between the source block pixel and the average with rounding of the corresponding block pixels in the 2 reference buffers. Functions are available for m = 16, 8, and 4. This function calculates 2 mxn SADs for the right and the current macroblocks. Functions are available for m = 16, 8, and 4. This function calculates the 8 8x4 SADs of n horizontal positions for a macroblock. The SADs are stored into the sad8x4 array. This function calculates the 4 4x4 SADs of n horizontal positions for a macroblock. The SADs are stored into the sad4x4 array. - 4 -

Table 2. Cycle count and Optimisation gain percentage for each function. Function Name C code SA code Opt-gain versus C code (%) qp_sad16xnv2 1200 192 84 qp_sad16xnh2 1144 168 85 sad8xnv2 536 120 78 sad8xnh2 480 112 76 qp_sad4xnh2 280 104 62 split_sad 632 120 81 qp_sad4xnv2 256 96 62 qp_sad8xnv2 656 136 79 sad4xnh2 208 88 57 qp_sad8xnh2 648 152 76 sad16xnv2 736 144 80 sad4xnv2 208 88 57 sparse_sad16x4 936 304 67 sad16xnh2 760 160 79 split_sad8x4 2544 392 85 split_sad8x8 1000 200 80 Table 3. Experimental results for C and SA code over all encoder. Type Encoding Speed (f/s) C code 29.71 SA code 60.64-5 -