A Video CoDec Based on the TMS320C6X DSP José Brito, Leonel Sousa EST IPCB / INESC Av. Do Empresário Castelo Branco Portugal

A Video CoDec Based on the TMS320C6X DSP José Brito, Leonel Sousa EST IPCB / INESC Av. Do Empresário Castelo Branco Portugal jbrito@est.ipcb.pt IST / INESC Rua Alves Redol, Nº 9 1000 029 Lisboa Portugal las@inesc.pt Abstract This paper presents a video Coder/Decoder (CoDec) developed for the TMS320C6x DSP, according to the H.263 Recommendation. This recommendation was selected because it achieves great compression ratios, but requires fast computing platforms for compressing pictures in real time. The main goal of this work is to meet real time specifications with only one DSP. Another important objective is to use full-search block-matching (FSBM) in the motion estimation stage to maximize the compression ratio. A new efficient full-search block-matching algorithm for motion estimation on VLIW DSPs is proposed. We prove in this paper that it is possible to implement a real-time video CoDec with FSBM on a single TMS320C6x, by using the proposed algorithm. I. Introduction Recently, the demand on video transmission systems has been facing a considerable increase on multimedia systems [2]. However, the bandwidth of transmission channels demands the use of compression algorithms. These compression algorithms are computationally very intensive, namely those for motion estimation. The initial objective of this work was to develop an H.263 video CoDec [3] that would code QCIF pictures using full-search block-matching motion estimation, one motion vector per macroblock within the range [-16; 15] and full pixel precision on a single DSP (TMS320C6x [1]). Since the motion estimation stage is usually the most computationally intensive stage of a video CoDec, a significant part of the optimization effort has been focused on it. Several techniques have been used to reduce it s computational cost, namely by developing a specially optimized block-matching algorithm, which applies non-sequential order for block-matching in a given search area. The DCT and DCT -1 stages use some routines already developed for C6x DSP s [4]. The stage was inspired in routines of a software CoDec from Telenor [5] (in C), with some optimizations and changes. Therefore, the presented coder was completely developed in assembly language, except some routines for the Variable Length Coding (). As QCIF pictures are considerably large, some data had to be put in external memory. Because of the greater access times, special care had to be taken in selecting which data was to be placed in external memory. The decoder was also inspired in the Telenor CoDec. As the computational weight of the decoder is far lower than the one of the coder, some simple optimizations are enough to achieve about 26 decoded pictures per second, meeting the real time goal. Tests have been performed on benchmark video sequences. Results are presented later in this paper. The structure of a H.263 coder is presented in Figure 1. DCT Q Motion Estimation 1 Q 1 DCT Figure 1 Structure of a H.263 video coder The different processing stages have been developed and put together, as explained in the next sections. II. DCT and DCT -1 The DCT and DCT -1 implementation is based on DCT and DCT -1 routines publicly available [4]. Some changes were made to the original routines. For each call to the original routines, only one 8x8 block of pixels was processed. An external loop was introduced in the routines to make it possible to process a parameterized number of blocks, reducing the overhead in function calls. This becomes relevant, as we typically want to process not just one block but a number of adjacent blocks. In this coder, all the blocks in a GOB (a row of macroblocks) are processed at a time. The DCT processing of a single 8x8 block takes 226. The DCT -1 processing of a A Video CoDec Based on the TMS320C6X DSP - 1 -

single 8x8 block takes 230. Table 2 presents results for the processing of a whole QCIF image. Some additional are needed for overhead in function calls. Both routines process halfwords (16 bit values). III. Quantization, Inverse Quantization and Zig-Zag Scanning The DCT coefficients of each block must be quantized and scanned in a zig-zag order, as shown in Figure 2. Figure 2 Zig-zag scanning of DCT coefficients These operations are executed in a single assembly routine. To speed-up the process, only 2-power values are allowed as quantization steps. In this way, to quantize a given coefficient, we only have to shift the absolute value by the square root of the quantization step. The processing of a coefficient corresponds to the following operations: read the coefficient, detect its signal, find the absolute value, shift it, and store it. To simplify the process, the quantized coefficient is stored in 1-complement: if the coefficient is negative, it s quantized absolute value is OR ed with 0x8000h. So, to quantize a coefficient we only apply the following instructions: 1 LDH, 1 EXTU, 1 ABS, 1 SHRU, 1 OR and 1 STH. As there are no dependencies, these 6 instructions can all be placed in one single execute packet, each one processing a different coefficient. One coefficient will have finished it s processing in each cycle. Table 2 presents results for a whole picture. Some additional are needed for overhead in function calls. The zig-zag scanning was done by reading the coefficients as in Figure 2, which is not a regular order. Because of this, it was necessary to completely loop-unroll the routine, as the load instruction always uses a different address. The inverse quantization routine simply implements the following equations: REC = QUANT (2 LEVEL + 1) if QUANT = odd (1) REC = QUANT (2 LEVEL + 1) - 1 if QUANT = even (2) To inverse quantize a coefficient we perform the following operations: read the level; detect the signal; multiply the absolute value by 2; add 1 (if it s not zero); multiply it by the quantization value; subtract the result by 1 (if the quantization step is even); store the reconstruction value. To store the reconstruction value in 2-complement, one additional instruction is needed: multiplying the reconstruction level by 1 if LEVEL is negative. This processing can be done with the following instructions: 1 LDH, 1 AND, 1 EXTU, 1 ADD, 1 MPY, 1 ADD, 1 MPY and 1 STH. These instructions could not be fitted in one execute packet, because some are conditional: for example, if LEVEL is negative MPY is executed. Because of live-too-long issues, the target register cannot be the operand register. So if condition is false, a MV instruction must be used to pass data to next instruction (STH). Two execute packets are needed, which means one coefficient finishes processing every 2. Table 2 presents results for a whole picture. Some additional are needed for overhead in function calls. The inverse quantization routine can be used with any quantization step. As in the quantization routine, there is a non-sequential order in reading data, so complete loop unrolling was also necessary. IV. Variable Length Coding Some of the routines are inspired in the routines of an H.263 CoDec implemented in C language by Telenor. code mainly analyzes data or control variables and replaces them with a new value found in a table. Therefore, it has a lot of conditional processing, which makes it very hard to exploit pipelining. To overcome this difficulty, several different routines were developed to process data in different conditions. In this way, condition evaluation is transferred outside the routines and processing needs fewer instructions. For example, different routines are used for INTRA and INTER modes. Nevertheless, a lot of conditional processing remains. Programming in assembly does not significantly improve performance of the coder. This is why the routines are in C. V. Motion Estimation One of the initial objectives was to use fullsearch block-matching, therefore testing every possible candidate motion vector. It was decided to have motion vectors within the range [-16; 15], to have full pixel precision and A Video CoDec Based on the TMS320C6X DSP - 2 -

to have one motion vector per macroblock. In these conditions we must execute 32x32 blockmatches for each macroblock (except for macroblocks in the border of the picture). To process each picture, 82497 block-matches must be executed. An optimised blockmatching routine was developed. To execute a block-match, 2 criterion may be used: Mean Absolute Difference (MAD) which is given by (3), or Mean Square Error (MSE) given by (4). In (3) and (4), a x,y are pixels in the current picture and b x,y are pixels in the reference picture. ( a x, y bx, y ) MSE = (3) MAD = a x y b (4), x, y It was decided to use MSE, for reasons that will soon be clear. If MSE were used, the sequence of operations for each pixel would be: read a pixel from the current picture, read a pixel from the reference picture, subtract the values, multiply the result by itself and accumulate. In terms of instructions, this means 2 LDB, 1 SUB, 1 MPY and 1 ADD. These 5 instructions can be fitted into 1 execute packet, and 1 pair of pixels would be processed for each cycle (pixels are 8 bit values). If pixels were stored in memory as 16- bit values, one single LDW instruction would read a pair of pixels from a picture, placing one in the 16 high order bits and the other in the 16 low order bits of the target register. It is then possible to use the SUB2 instruction to simultaneously subtract 2 pairs of pixels. Multiplications are executed by an MPY and an MPYH. Two accumulations are now needed. We now have: 2 LDW, 1 SUB2, 1 MPY, 1 MPYH, 2 ADD. These 7 instruction can also be fitted into one single execute packet, leaving one instruction free for loop control. In this situation, 2 pairs of pixels are processed for each cycle. To use MAD with the same performance, we would need an instruction like ABS2, which does not exist. With the procedure proposed above, two problems arise: it s impossible to execute block-matches that begin in even pixels, and the memory space required to store pictures doubled. To solve these problems, the following strategy was developed. Pixels are stored as 8 bit values and motion estimation is performed one GOB at a time. For each GOB, only the necessary pixels are transformed into 16 bit values and placed in separate auxiliary memory sections. The necessary pixels are the ones from all the macroblocks of the GOB from the current picture, and also the ones from all the macroblocks in the surrounding GOB s from the reference picture, so two 2 auxiliary memory sections are needed. Using this strategy saves a lot of data memory. First we execute all the block-matches beginning in odd pixels for every macroblock of the GOB. Then, the pixels are transformed into 16 bit value one more time, but beginning in the second pixel of the GOB, therefore creating a new alignment of pixels in memory, to perform the block-matches that begin in even pixels. Special routines have been developed to transform pixels from 8 bit values to 16 bit values. The last optimisation was crucial to the performance of the coder. During the execution of a given block-match, the partial error value is compared to the minimum error already calculated for the related macroblock. If the partial error is already bigger than the minimum error, the block-match can be aborted, avoiding executing useless processing. The additional instructions required (1 ADD to add the two accumulation registers, 1 CMPGTU to compare values and 1 B to jump outside the loop) can be inserted into the existing execute packets, with no need for additional execute packets. This optimisation, combined with the fact that we execute first all the odd block-matches, and then all the even block-matches, leads to an average saving of about 60% of operations. There is a detail in memory access when a TMS320C6201 DSP is used. There is only one memory block in internal data memory and two memory accesses are executed on every cycle. The difference between a given block-match to next block-match, is that the pixels being read from the current picture are the same but the pixels read from the reference picture are the adjacent pair of pixels. This means that memory stalls will occur on every cycle of every other block-match. If a TMS320C6202 were considered, placing the pixels of different pictures in different memory blocks would solve this problem. VI. The CoDec Having developed the processing stages described above, two issues remain to be solved. First, how will they work together, and second, how will data memory be managed. If the picture is to be coded in INTRA mode, each picture is processed one GOB at a time. The pixels from each GOB are simply transformed into 16 bit values, DCT is performed for all macroblocks, as are the quantization,, inverse quantization, DCT - 1 and the transformation of pixels back to 8 bit values. Partial results are placed in the same auxiliary memory sections that were used in motion estimation, since these sections are no A Video CoDec Based on the TMS320C6X DSP - 3 -

longer in use. Final results are placed in the section where the reference picture is stored, because the reconstructed picture will become the next reference picture. If the picture is to be coded in INTER mode, the motion estimation processing is executed for all the macroblocks in the picture. After motion estimation is finished, the picture is processed one GOB at a time. The pixels in each macroblock from the current picture are subtracted from the pixels in the corresponding macroblock in the reference picture, considering the corresponding motion vector. This creates difference macroblocks. These difference macroblocks are placed in the auxiliary data memory sections, as 16 bit values. DCT, quantization,, inverse quantization and inverse DCT are performed. Then, the difference macroblocks are added to the macroblocks from the reference picture, considering once again the corresponding motion vector, to reconstruct the original picture. This is called motion compensation. The reconstructed picture is placed in the data memory section where the reference picture is stored, as the reconstructed picture will become the next reference picture. Some routines were developed to perform pixel sum and subtraction. If a TMS320C6201 is to be used, not all of the data can be placed in the 65536 bytes of internal data memory. Two QCIF pictures would take 76032 bytes (with pixels as 8 bit values). Obviously, one of them will have to be stored in external memory. Besides pictures, other data must be stored, as tables, motion vectors, system stack, output stream and the auxiliary memory sections. The auxiliary sections alone are the equivalent to 4 GOB s (1 GOB from the current pictures and 3 GOB s from the reference picture) with pixels as 16 bit values (22528 bytes). They are accessed very intensively, and it is important that they stay in internal memory. One picture had to be placed in external memory. It was decided to choose the current picture, because it is accessed less intensively than the reference picture. The output stream was also placed in external memory, because the space it uses is variable, and a big memory space must to be allocated, although most of it will rarely be used. All the remaining data could be placed in internal data memory. VII. Experimental results Benchmark video sequences were used to test the coder performance such as Carphone, Claire, Silent and Suzie. Table 1 shows the number of frames per second for different DSP versions and frequencies. C6201 C6202 166 MHz 15frames/s - 200 MHz 18frames/s 28frames/s 250 MHz - 36frames/s Table 1 Average number of coded pictures per second for different DSPs and frequencies Table 2, Table 3 and Table 4 show approximate number of per frame, for each stage. DCT/ Quant./ DCT -1 I.quant 134244/ 136620 38016/ 76032 Table 2 Approximate number of processing a QCIF picture 1,4 million Pixel Motion Estimation Transf. Abort N/abort 1 7 16,7 million million million Table 3 - Approximate number of processing a QCIF picture (C6201) Pixel Motion Estimation Transf. Abort N/abort 80000 4,8 11,4 million million Table 4 - Approximate number of processing a QCIF picture (C6202) Pixel transformation includes access to external memory; access times should be lower in a real C6201 DSP. We can see that the real time objective was achieved. Average percentages of for the different processing stages are shown in Figure 3 and Figure 4. 80 70 60 50 40 30 20 10 0 63,7 Motion Estim. 10,2 Pixel Transf. C6201 1,8 4,0 Motion Comp. DCT, IDCT, quant., iquant. 13,0 7,3 Figure 3 Percentage of for different processing stages on the C6201 An obvious conclusion is that motion estimation is the most intensive processing Other A Video CoDec Based on the TMS320C6X DSP - 4 -

stage, as was initially foreseen. Its performance defines almost completely the performance of the whole system, hence the importance it has been given in the development and optimization of this part of the coder. 80 70 60 50 40 30 20 10 0 69,3 Motion Estim. C6202 1,1 2,9 6,3 Pixel Transf. Motion Comp. DCT, IDCT, quant., iquant. 20,1 Figure 4 - Percentage of for different processing stages on the C6202 The differences in performance between the two processors are due to internal memory stalls and access to external memory. For the C6202 all the data is in internal memory, so no access to external memory is necessary. Also for the C6202, the two auxiliary memory sections are placed in two different memory blocks, so no internal memory stalls occur. Because of these two reasons, the processing stages that are most improved are motion estimation (no stalls occur), pixel transformation (no external memory is accessed) and other (consists mainly in copying pixels to different memory sections). As a result all the other processing stages increment their computational weight in the system, although the number of stays the same. Despite the improvement in the motion estimation stage, it s relative computational weight slightly increases because the improvement in other stages is more effective. The simulated access times for external memory are big, which is why results for the C6202 are so much better. Real systems with a C6201 are expected to have better results. The decoder was also inspired in the Telenor CoDec, with some optimisations, mainly the use of Texas Instrument s DCT and DCT -1 routines and placing the most intensively accessed data in internal memory. These and other small changes to the code were enough to make the decoder work in real time. Average decoding rates were about 26 pictures/s. 0,3 Other VIII. Conclusions This paper shows that it is possible to implement a video CoDec capable of real time coding/decoding using FSBM, which allows greater compression ratios for a given SNR on a single TMS320C6x DSP. This was accomplished through the use of specially optimized code, developed in assembly and a new strategy in full-search block-matching, that resulted in a 60% improvement in the DSP performance for motion estimation. IX. Future plans Future plans include the implementation of the CoDec in a C6201 EVM, to test performance in a real system, with realistic access times to external memory. Refining the decoder, so both coder and decoder can co-exist in a single C6x, is being considered, as is the development of a CoDec for CIF pictures, with a sub-optimal motion estimation method, like OTS. Another final improvement would be the re-writing of the quantization routines, so they can accept any quantization step. X. Acknowledgements The authors thank Texas Instruments for offering of the C6x development platform, under the TI University Program. XI. References [1] TMS320C6000 CPU and Instruction Set Reference Guide, Texas Instruments, March 1999 [2] Vasudev Bhaskaran, Konstantinos Konstantidides, Image and Video Compression Standards Algorithms and Arquitectures, Kluwer Academic Publishers, 2 nd edition, 1997 [3] Draft ITU-T Recommendation H.263, Video Coding for Low Bitrate Communication,May 1997 [4] TMS320C62x Assembly Benchmarks at Texas Instruments, http://www.ti.com/sc/docs/products/dsp/c6000/ 62bench.htm#graphics [5] Telenor Research and Development, http://www.fou.telenor.no/brukere/dvc/ [6] O. P. Sohm, D. R. Bull, C. N. Canagarajah, Fast 2D-DCT Implementations for VLIW Processors, IEEE Third Workshop on Multimedia Signal Processing Proceedings, Copenhagen, September 1999 XII. Authors Profile José Brito is a graduate in Electrical Engineering and Computers Science. He is a researcher at INESC and teaches in Electrical Engineering at Instituto Politécnico de Castelo Branco. Leonel Sousa is a professor with Instituto Superior Técnico in Lisbon, Dept of Electrical Engineering, and a senior researcher at INESC. A Video CoDec Based on the TMS320C6X DSP - 5 -