SAD implementation and optimization for H.264/AVC encoder on TMS320C64 DSP

SETIT 2007 4 th International Conference: Sciences of Electronic, Technologies of Information and Telecommunications March 25-29, 2007 TUNISIA SAD implementation and optimization for H.264/AVC encoder on TMS320C64 DSP M. A. BEN AYED, A. SAMET, N. MASMOUDI Electronics and Information Technology Laboratory, University of Sfax, National School of Engineering. BP W 3038 Sfax, TUNISIA Mohamedali.benayed@isecs.rnu.tn uri.masmoudi@enis.rnu.tn Abstract: Motion estimation in video coding standards, such as H.264/AVC, is considered to be the most timeconsuming encoding module. Motion estimation is generally performed on a 16x16 block, although in H.264/AVC, 7 different block sizes (16x16, 16x8, 8x16, 8x8, 8x4, 4x8 and 4x4) are allowed. The aim of this paper is to optimise the implementation of the motion estimation algorithm on the Texas Instruments TMS320C64 DSP. Specifically, the goal is to use the C64 set of instructions in order to optimise the Sum of Absolute Differences (SAD) engine within the motion estimation and to take advantage of the Direct Memory Access (DMA) to reduce the cycle cost in loading data from external to internal memory. Standard Assembly (SA) is used to implement the different SAD functions in order to exploit the C64 internal architecture and resources efficiently. Experimental results shows more than 75% improvement in terms of cycle cost compared to C code for each function. Key words: H.264/AVC, Motion estimation, SAD, TMS320C64, SA. INTRODUCTION The recently standardized H.264/MPEG-4 AVC video coder [THO 02] (formerly known as ITU-T H.26L) is the result of the work carried out by a Joint Video Team (JVT) part of the International Telecommunication Union (ITU-T VCEG Video Coding Experts Group) and of the International Organization for Standardization (ISO/IEC MPEG Moving Picture Experts Group). This upcoming standard is called to play an important role in the Broadcasting market since it provides advances in digital video implementations in terms of bit rate reduction, transmission resiliency and video quality. However, the impressive coding efficiency of H.264/AVC comes at the expense of significantly increased algorithmic complexity compared to existing standards, which has limited the availability of cost-effective, high-performance solutions [VAN 04]. In fact, most of the existing real-time encoders for H.264/AVC are implemented on a DSP platform due its software flexibility for being upgraded, relatively low software development cost, and time-to-market reduction. We will implement our SAD engine on the C64 DSP from TI since it is the most suited and architectured for multimedia applications. From recent works [HOR 03], the SAD engine consumed most of the encoding time either in the inter-coding for motion estimation or intra-coding module. Given that the H.264/AVC allows 7 different block sizes 16x16, 16x8, 8x16, 8x8, 8x4, 4x8 and 4x4 resulting in 16 different functions to be implemented using a standard assembly (SA) description. Each function has its own characteristics and properties in terms of data transfer and computational dependency. We will compare our results to the C code. Our paper is structured as follows: next section describes the internal architecture of our platform, which is TI C64. Section three details the implementation strategies and illustrates the implementation schema for a particular function. Experimental results and discussion are presented in section four. Finally, section five concludes this paper with some constructive perspectives. 1. C64 internal architecture and main functions 2.1. Overview Tuning the video codec software for DSP implementation involves several steps. Traditional development flows in the DSP industry includes the following. Construct a C model for validating purpose. As the modern DSP compilers become more mature, - 1 -

they can do part of the laborious work of instruction selection, parallelizing, pipelining, and register allocation. However, we still often find that the compilers are making mistakes from time to time. In addition, in order to make the final code more compact in size and faster in speed, the C codes have to be tuned to match the DSP architecture. Figure 1 shows the typical three-step DSP code development flow [TEX 01]. For porting to DSP, the data type shall be first considered since the definition of data such as integer can be different for different processors. For example, on TI C6000, the long integer means 40 bits. Since the H.264/AVC codec deals with 8-bit pixels, the programmer can use the short data type for fixed-point multiplication, which takes only one cycle. Further optimizations shall make maximal use of all the hardware resources in the critical loops. 2.2. C64 internal architecture The TMS320C64x is a fixed point DSP features very long instruction word (VLIW) architecture developed by TI (VelociTI) [TEX 00]. This architecture is a high-performance, advanced, making these DSPs excellent choices for multimedia and multi-function applications such as MPEG4 encoder. VelociTI, together with the development tool set and evaluation tools, provides faster development time and higher performance for embedded DSP applications through increased instruction-level parallelism. The C6416 processor consists of three main parts: CPU (or the core), peripherals, and memory. Eight functional units operate in parallel, with two similar sets of the basic four functional units. The units communicate using a cross path between two register files. Program parallelism is defined at compile time because there is no data dependency checking done in hardware during run time. The 256-bit-wide program memory fetches eight 32-bit instructions every single cycle together with the 64-bit-wide data bus suitable for 8 pixels download/storage. All of these features make the C6416 the most suited DSP for video processing. Figure 2 illustrates the internal architecture of the C64. Each functional unit has its own 32-bit write port into a general-purpose register file. All units ending in 1 (for example,.l1) write to register file A and all units ending in 2 write to register file B. Each functional unit has two 32-bit read ports for source operands src1 and src2. Four units (.L1,.L2,.S1, and.s2) have an extra 8-bit-wide port for 40-bit long writes as well as an 8-bit input for 40-bit long reads. Because each unit has its own 32-bit write port, all eight units can be used in parallel every cycle. The most interesting instructions that will be exploited in the implementation of SAD engines are: - LDDW: Load double word (64 bits). - LDNDW: Load non-aligned double word. - STDW: Stock double word. - PACK: Packtisation of 4 pixels. - DOTPU4: Dot product of unsigned 4-4 pixels. - SUBABS: Sum of absolute difference of 4 by 4 pixels. - AVGU4: Averaging unsigned 4-4 pixels. For further information please refer to [TEX 02]. Phase 1: Develop C code Phase 2: Refine C code Phase 3: Write linear assembly Write C code Compile Refine C code Compile More C optimisation Write linear assembly Assembly optimize Figure. 1. DSP code development flow. 2. Functions description and optimization technique The most commonly used metric to evaluate the match is the (SAD), which adds up the absolute differences between corresponding elements in the macroblocks. It is given by the following formula: 4,8,16 4,8,16 SAD ( x, y, r, s) = A( x+ i, y+ j) B( ( x+ r) + i, ( y+ s) + j) i= 0 j= 0 Where 0 < x ; y < frame size, (r; s) being the motion vector, A(x; y) being a current frame pel at (x; y), and, B(x; y) being a reference frame pel at (x; y). Since H.264/AVC permits 7 different block sizes and up to quarter-pel precision for different mode of access (horizontal and vertical), Ublive software developed by UBvideo Inc. [UBV], which is an encoder highly optimized algorithmically, enabling it to achieve objective and subjective performance levels close to the public JM encoder with significantly reduced time complexity, uses 16 different functions to implement the SAD engine. Those 16 functions included in the SAD engine are presented in Table 1. In order to exploit the hardware resources on the C64 and to take advantage of its overall architecture, we have to describe each function by a SA code. For that reason, we shall take into our consideration the following approaches: - 2 -

- Examining the C code provided by Ubvideo Inc. for each function and draws the data dependency. - Partitioning data into 2 sectors, one will be processed by A side the other by B side of the CPU. - Downloading data from memory to CPU should be performed using 8 pixels transfer at a time. This is possible only for the case where source and reference frame are 16 or 8 pixels wide. Otherwise, perform the download by 4 pixels at a cycle. H.264/AVC permits 7 different block size (16x16, 16x8, 8x16, 8x8, 8x4, 4x8, and 4x4) - Use the SUBABS instruction on the downloaded source and reference frame. - Unroll the inner loop whenever is possible, experimental results showed a major gain in speed in terms of cycle count. - During our optimisation process, no PSNR change is permitted. In other word, our results have to confirm with C code exactly. 3. Experimental results and discussion Our experiments are carried out on a DSK 6416 running at 720 Mhz. This board serves as a hardware reference design for the TMS320C6416 DSP. Figure 3 illustrates the general description for sad16xnv2 function. The purpose of this function is to do SAD between four buffers (pointed to by A_cur and B_ref). This block are logical perceived as 2- dimensional 16x16, and each one is divided on 4 blocks 8x8 which are calculated separately and stored into B_sad_matrix array. Table 2 illustrates that we have reached more than 75% optimisation compared to the original C code and in some functions we got up to 85% in terms of cycles count. This is due to the fact that we have exploited all DSP resources and controlled the data transfer adequately. Table 3 illustrates the experimental results for SA and C code over all encoder. It is clear that we get 104% increase in the encoding speed in terms of frames/sec compared to C code. This is considered to be excellent results since Ublive code is supposed to be the most optimal code on the market. 4. Conclusion In this paper, we have considered to implement efficiently the SAD engine on C64 DSP in order to reduce the time consumption for H.264/AVC motion estimation module. Since H.264/AVC offers 7 different block sizes, 16 SAD functions have been implemented using a well-optimised SA code that enables us to benefit from the internal architecture of C64, which is considered to be the most suitable for any real-time multimedia application. Up to 75% reduction in terms of cycle count is obtained compared to the C code. As perspectives, we can optimise other modules that are time consuming like the interpolation module, and intra-prediction module. REFERENCES [THO 02] Thomas W, "Study of Final Committee Draft of Joint Video Specification", ITU-T Rec. H.264 ISO/IEC 14496-10 AVC, Draft 1, December, 2002. [VAN 04] Vanghn Iverson, Jeff Mc Veigh, Bob Reese, "Real-Time H.264/AVC Codec On Intel Architectures", International Conference on Image Processing ICIP, pp. 757-760, 2004. [HOR 03] Horowitz M., Joch A., Kossentini F., Hallapuro A., "H.264/AVC baseline profile decoder complexity analysis", Circuits and Systems for Video Technology, IEEE Transactions, Volume 13, Issue 7, July 2003. [TEX 01] Texas Instruments, "TMS320C6000 Programmer Guide", 2001. [TEX 00] Texas Instruments, "TMS320C6000 CPU and Instruction Set Reference Guide", SPRU189, 2000. [TEX 02] Texas Instruments, "TMS320DM642 Video/Imaging Fixed-Point Digital Signal Processor", SPRS200A, 2002. [UBV] UBvideo, www.ubvideo.com. Figure 2. Internal C64 Architecture. - 3 -

Execute loop n times 1 A_cur B_cur 16 pixels A_r ef 16 pixels B_ref A_ w B_w A_cur(post incrimenté par A_w) B_cur(poste incrimenté par A_w) A_ref (post incrimenté par B_w) B_ref (post incrimenté par B_w) A_curpix A_curpix1 B_curpix B_curpix1 A_ref pix A_ref pix1 B_ref pix B_ref pix1 A_curpix A_curpix1 B_curpix B_curpix1 la valeur absolue 8 bits_8 bits la valeur absolue 8 bits_8 bits A_ref pix A_ref pix1 B_ref pix B_ref pix1 = = = = A_sad 1 B_sad 1 + + + + B_sad 1 A_sad 1 = = = = Repeat loop + = B_sadArray B_sad1 A_sad1 + B_sad1 = B_sad1 Figure 3. Data flow for the sad16xnv2 function on the C64 Platform. Function Name qp_sadmxnh2 sadmxnv2 sparse_sad16x4 qp_sadmxnv2 sadmxnh2 split_sad8x4 split_sad8x8 Table 1. Different SAD functions and their description. Description This function calculates the current and the right SAD values for 1/4 pixel mxn blocks. The difference is taken between the source block pixel and the average with rounding of the corresponding block pixels in the 2 reference buffers. Functions are available for m = 16, 8, and 4. This function calculates 2 mxn SADs for the bottom and the current macroblocks. Functions are available for m = 16, 8, and 4. This function calculates the 4 16x4 SADs of 7 horizontal positions for a macroblock. The SADs are stored into the sad16x4 array. This function calculates the current and the bottom SAD values for 1/4 pixel mxn blocks. The difference is taken between the source block pixel and the average with rounding of the corresponding block pixels in the 2 reference buffers. Functions are available for m = 16, 8, and 4. This function calculates 2 mxn SADs for the right and the current macroblocks. Functions are available for m = 16, 8, and 4. This function calculates the 8 8x4 SADs of n horizontal positions for a macroblock. The SADs are stored into the sad8x4 array. This function calculates the 4 4x4 SADs of n horizontal positions for a macroblock. The SADs are stored into the sad4x4 array. - 4 -

Table 2. Cycle count and Optimisation gain percentage for each function. Function Name C code SA code Opt-gain versus C code (%) qp_sad16xnv2 1200 192 84 qp_sad16xnh2 1144 168 85 sad8xnv2 536 120 78 sad8xnh2 480 112 76 qp_sad4xnh2 280 104 62 split_sad 632 120 81 qp_sad4xnv2 256 96 62 qp_sad8xnv2 656 136 79 sad4xnh2 208 88 57 qp_sad8xnh2 648 152 76 sad16xnv2 736 144 80 sad4xnv2 208 88 57 sparse_sad16x4 936 304 67 sad16xnh2 760 160 79 split_sad8x4 2544 392 85 split_sad8x8 1000 200 80 Table 3. Experimental results for C and SA code over all encoder. Type Encoding Speed (f/s) C code 29.71 SA code 60.64-5 -