A Dynamic Quality-Adjustable H.264 Video Encoder for Power-Aware Video Applications

Size: px

Start display at page:

Download "A Dynamic Quality-Adjustable H.264 Video Encoder for Power-Aware Video Applications"

Anthony Tate
6 years ago
Views:

1 TCSVT A Dynamic Quality-Adjustable H.264 Video Encoder for Power-Aware Video Applications Hsiu-Cheng Chang, Jia-Wei Chen, Bing-Tsung Wu, Ching-Lung Su, Jinn-Shyan Wang, and Jiun-In Guo Abstract This paper proposes a dynamic quality-adjustable H.264 Baseline Profile (BP) video encoder that comprises 470Kgates and 13.3Kbytes SRAM in a core size of 4.3x4.3mm 2 using TSMC 0.13µm 1P8M CMOS technology. Exploiting parameterized algorithms for motion estimation and intra prediction, the proposed design can dynamically configure the encoding modes with the design trade-off between power consumption and video quality for various video encoding applications. In addition, the proposed Basic Unit (BU) based rate control hardware can maintain a constant and stable bit-rate for network video transmission. It achieves real-time H.264 video encoding on CIF, D1, and HD720@30fps with 7mW-to-25mW, 27mW-to-162mW, and 122mW-to-183mW power dissipation in different quality modes. Index Terms Quality-adjustable, H.264, baseline profile, video encoder, HD720 I I. INTRODUCTION SO/IEC Moving Picture Experts Group (MPEG) and ITU-T Video Coding Experts Group (VCEG) jointly developed the video standard, H.264/AVC [1] for next generation multimedia coding applications. The H.264 video encoder system is composed of various efficient coding techniques, including variable block size motion estimation and motion compensation with precision up to quarter-pixel prediction, various block size (16x16/4x4) intra prediction, in-loop de-blocking filtering and context adaptive entropy coding, which exhibits high coding efficiency by providing more accurate estimation results at the cost of much higher computational complexity [2]. As a result, the computational complexity of H.264 video coding is much higher than those of the previous MPEG standards, which induces the necessity of achieving real-time processing of H.264 video coding through dedicated hardware designs. Manuscript received December 15, 2008; revised March 12, 2009, and May 22, This work was supported by National Science Council of Taiwan under Grant NSC E This paper was recommended by Associate Editor Justin Ridge. H.-C. Cheng, B. -T. Wu and J. -I. Guo are with the Department of Computer Science and Information Engineering, National Chung-Cheng University, Chia-Yi, 621 Taiwan, R.O.C., ( changhsc@cs.ccu.edu.tw; wupt@cs.ccu.edu.tw; jiguo@cs.ccu.edu.tw). J.-W. Cheng and J. -S. Wang are with the Department of Electronics Engineering, National Chung-Cheng University, Chia-Yi, 621 Taiwan, R.O.C., ( 92jiawei@vlsi.ee.ccu.edu.tw; ieegsw@ccu.edu.tw). C. -L. Su is with the Department of Electronics Engineering, National Yunlin University of Science Technology, Yun-lin, Taiwan, R.O.C., ( kevinsu@yuntech.edu.tw) Copyright (c) 2009 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending an to pubs-permissions@ieee.org. In addition, improving the hardware efficiency of video coding LSI like MPEG-4/H.264 is a recent design trend in implementing multimedia systems aimed at high-throughput design for high definition (HD) video [3, 8, 9] and low-power design for portable video [4, 5]. They are designed for one specific application for either HD video with high video resolution or portable video with smaller video resolution. Another dual-mode video codec design [10] supporting H.264/MPEG-4 not only satisfied HD720 (1280x720) video encoding, but also achieved the low power consumption. However, it was still designed for middle or high video resolution with good picture quality. All of the dedicated hardware architectures for H.264 video encoders [3, 4, 7, 8, 9, 10] lacked flexibility owned by the programmable multimedia processors to adjust the video qualities by selecting different video coding algorithms in execution. This flexibility is good for the power-aware video applications that need to trade-off the video coding quality and power consumption. Although the design [5] proposed an adaptive power-aware fast motion estimation algorithm in order to trade-off the video coding quality and power consumption by selecting different coding parameters, it was designed to support low and middle video resolutions up to H.264 SDTV (720x480) video encoding for reducing power consumption. Moreover, the state-of-the-art multimedia processors also have performance limitation up to H.264 D1 video encoding [6]. Therefore, in order to achieve both the real-time encoding for high resolution videos and dynamic quality adjustability for power-aware video applications, there are challenges to be overcome for achieving the flexibility in configuring the dedicated hardware architectures for H.264 video encoder. To achieve high throughput rates, low-power consumption, and dynamic quality adjustability, we propose a dynamic quality-adjustable H.264 BP video encoder that supports both versatile video resolutions from QCIF to HD720 and versatile video qualities on the same video (e.g. D1 video) when operating at different clock frequencies. The proposed design exploits the parameterized motion estimation and intra coding algorithms that could be dynamically configured to operate at different quality modes with different computational complexity, which enables the proposed design to be operated at different clock rates to exhibit different power consumption for some specific video resolutions. In addition, we also propose some design techniques to reduce the computational complexity of the key processing modules in H.264 video encoding, including two-stage fast integer motion estimation

2 TCSVT algorithm, fractional motion estimation algorithm with block-size trend prediction, fast luminance intra 4x4 search algorithms (i.e. Context Correlation Search Algorithm and Probability Context Correlation Search Algorithm ), fast chrominance intra search algorithm (i.e. Quarter Macro Block Search Algorithm ), and high-throughput scanning scheme for entropy coding. Furthermore, the compressed multimedia contents nowadays are often transmitted through heterogeneous networks, in which maintaining a constant and stable bit-rate is of great importance to achieve good video quality. The rate control algorithm in H.264 reference software JM [12] consists of three levels: i.e. Group of Picture (GOP) level, frame level, and BU level. Among them, BU level rate control algorithm owns better performance in allocating the constant data bits than the frame level rate control for video streaming. Observing the existing MPEG-4/H.264 video encoder designs [3, 4, 5, 7, 8, 9, 10], there are no BU-based rate control algorithms realized in these encoder hardware architectures. Instead, they used the frame level rate control algorithm. This is due to the strong data dependency of the BU-based rate control algorithm realized in a pipelined encoder, which makes it difficult to be realized in a pipelined H.264 video encoder design without increasing latency induced by the sequential rate control processing requirement. For solving this problem, we propose a BU-based rate control algorithm for the proposed H.264 video encoder by eliminating the data dependency of the original rate control in H.264. The proposed rate control algorithm owns better video quality as compared to JM frame-level rate control and facilitates the hardware realization in H.264 video encoders. Compared to the state-of-the-art H.264 BP video encoder designs, the proposed 470Kgates/13.3Kbytes SRAM H.264 video encoder not only achieves lower gate count and smaller internal memory, but also supports the unique feature of quality adjustability to trade-off the video coding quality and power consumption. Moreover, it achieves adjustable video encoding of 7mW-to-25mW for CIF@30fps, 27mW-to-163mW for D1@30fps, and 122mW-to-183mW for HD720@30fps when being operated at four different quality modes, i.e. QS0, QS1, QS2, and QS3. The QS0 mode has the best quality performance with the highest operating frequency among these four quality modes. On the other hand, the QS3 mode demonstrates the worst quality performance but the least power consumption. The maximum performance of the proposed design achieves encoding HD1080 video@20fps when it is operated at 108MHz in QS3 mode. The rest of this paper is organized as follows. We will present the proposed H.264 video encoder in Section II. In Section III, we will discuss the implementation and verification of the proposed design. In Section IV, we will evaluate the performance of the proposed design as compared to the existing ones. Finally, we conclude this paper in Section V. Fig. 1. Block diagram of the proposed H.264 BP video encoder. II. PROPOSED H.264 VIDEO ENCODER Fig. 1 shows the architecture of the proposed dynamic quality-adjustable H.264 video encoder design. Five key functional modules are optimized in the proposed design, including Integer Motion Estimation (IME), Fractional Motion Estimation (FME), Intra Coding, In-Loop Filter (ILF), and Entropy Coding (EC). To simplify the encoding flow and efficiently eliminate the data dependency, we adopt MacroBlock (MB) level pipelining schedule. There are four pipeline stages in the proposed design with the order of IME, FME, Intra Coding, and EC/ILF. The modules of EC and ILF are located at the same pipeline stage to speed up the performance. To support H.264 video encoding with adjustable qualities, configurations in both the ME and Intra Coding algorithms are provided through the System Controller. The System controller is implemented as a complex FSM. To improve video quality, a pipelined BU-based rate control scheme is used to efficiently allocate the data bits. To reduce internal memory size, a Predictive Data Store Buffer (PDSB) controller is adopted to efficiently access the intermediate data for video encoding through AHB-based SDR memory. In the following, the major design techniques adopted in the proposed design will be illustrated to exhibit the encapsulated high-throughput and dynamic quality-adjustable features. A. Integer Motion Estimation To support the dynamic quality-adjustable coding in the proposed design, we first encounter the problem to develop the flexible algorithms implemented in hardware architecture. Observing the existing IME designs, some of them [13-19] are implemented by the architecture based on the full-search block matching algorithm for doing IME. Although the advantage of the full search method is without any quality loss, it is difficult to achieve the quality adjustability based on the full-search block matching IME architecture. The other IME designs [20-24] are implemented by using fast-search block matching algorithms, like Three Step Search (TSS), Pixel Sub-Sampling, Data-adaptive, and MVFAST & Diamond search algorithms.

3 TCSVT (2,1,1) (2,2,1) (2,3,1) (2,4,1) (3,1,1) (3,1,2) (3,2,1) (3,2,2) (3,3,1) (3,3,2) (3,4,1) (3,4,2) (4,1,1) (4,1,2) (4,1,3) (4,2,1) (4,2,2) Fig. 4. The complexity ratio of the proposed IME algorithm in different configurations over the FSBM algorithm. (4,2,3) (4,3,1) (4,3,2) (4,3,3) (4,4,1) (4,4,2) (4,4,3) (5,1,1) (5,1,2) (5,1,3) (5,1,4) (5,2,1) (5,2,2) (5,2,3) (5,2,4) (5,3,1) (5,3,2) (5,3,3) (5,3,4) (5,4,1) (5,4,2) (5,4,3) (5,4,4) Fig. 2. Proposed HDLFS-IME algorithm.. Fig. 3 Illustration of the proposed HDLFS-IME algorithm with three configurable parameters. These IME designs mainly focus on reducing the computational complexity at the cost of sacrificing video quality in PSNR. However, they lacked of the flexibility for dynamically adjusting the number of search points in the encapsulated ME algorithms. The design [5] proposed a H.264 video encoder architecture along with a Four-Step Search (FSS) IME algorithm in order to trade-off the video coding quality and power consumption by selecting different iteration numbers of initial points in the variable block size FSS algorithm. However, it may suffer more quality drop because of only searching 25 points for doing IME in each reference frame. Therefore, in order to effectively reduce the hardware cost and dynamically adjust the video quality, we propose a low complexity, high flexibility integer motion estimation algorithm called Half-word Down-sample Local Full Search algorithm (denoted as HDLFS-IME algorithm) which achieves both accurate motion estimation and flexible configuration to achieve different quality modes dynamically. The data flow of the proposed HDLFS-IME algorithm is shown in Fig. 2. The illustration of the proposed HDLFS-IME operations with three parameters is shown in Fig. 3. The proposed HDLFS-IME algorithm consists of two stages. The Stage 1 in the proposed HDLFS-IME algorithm performs block matching operations on the variable-rate (determined by Down Sample Rate) 2-D down-sampled search points in the search window to select variable numbers (determined by Candidate Number) of good candidates for the next stage operations (i.e. Stage 2 operations). In the beginning of Stage 1, the current and reference pixels are truncated as half-word data (i.e. 4-bit data instead of 8-bit data) operations to reduce both the memory bandwidth and hardware cost. Down sampling on the search points is performed along both the horizontal and vertical directions starting from the center of the search range. In this stage, only the 16x16 MB partition size is processed to select several candidates. Then, proposed HDLFS-IME algorithm in the first stage sieves out several candidates as the center for doing the local full search operations in the second stage. The Stage 2 in the proposed HDLFS-IME algorithm performs the local variable-range (determined by Local Full Search Range) full-search block-matching operations on the search points around the selected candidates obtained by Stage 1 in the proposed algorithm to determine the best motion vectors for all the 41 block sizes in IME. In the following, the complexity analysis on both the proposed HDLFS-IME algorithm and he Full Search Block Matching IME algorithm (denoted as FSBM-IME algorithm) is considered. The total number of search points for one MB using the FSBM-IME algorithm (denoted as SP FSBM ) is shown in equation (1), where the search range is [-n, n]. Using the proposed HDLFS-IME algorithm, the number of search points for one MB in the first stage (denoted as SP HDLFS ) is shown in equation (2), and the number of search points in the second stage is shown in equation (3). The indexes z, y, and m are SP SP SP SP FSBM ( 2 n 1) 2 HDLFS _ st1 ( n / z 2 1) SP HDLFS FSBM 2 HDLFS _ st2 y ( 2m 1) ( n/ z 2 1) y(2m 1) 2 (2 n 1) 2 denoted as down sampling ratio, number of candidates, and the local search range [-m, m], respectively. The complexity ratio of the proposed HDLFS-IME algorithm over the FSBM-IME algorithm is shown in equation (4). Fig. 4 shows the complexity ratio plot of the proposed IME algorithm in different (z, y, m) configurations over the FSBM-IME algorithm, where n is equal to 16. For example, the complexity ratio is 9.6% while indexes (n, z, y, m) equal to (16, 5, 2, 2). The reduction of computational complexity using the proposed HDLFS-IME algorithm can achieve about 90.4% as compared to FSBM-IME algorithm.the major configurable parameters of the proposed HDLFS-IME algorithm are listed in Table I. 2 2 (1) (2) (3) (4)

TCSVT 3102 4 Fig. 5. The comparison of PSNR performance using the proposed algorithm in different configurations with JM full search algorithm at QCIF 256/512 kbit/s sequence Fig. 6.

8/1024 kbit/s sequences Fig. 7. The comparison of PSNR performance using the proposed algorithm in different configurations with JM full search algorithm at SDTV 1536/2048 kbit/s sequences.

4 TCSVT Fig. 5. The comparison of PSNR performance using the proposed algorithm in different configurations with JM full search algorithm at QCIF 256/512 kbit/s sequence Fig. 6. The comparison of PSNR performance using the proposed algorithm in different configurations with JM full search algorithm at CIF 768/1024 kbit/s sequences Fig. 7. The comparison of PSNR performance using the proposed algorithm in different configurations with JM full search algorithm at SDTV 1536/2048 kbit/s sequences. TABLE I THE CONFIGURABLE IME PARAMETERS IN THE PROPOSED DESIGN Fig. 8. The comparison of PSNR performance using the proposed algorithm in different configurations with JM full search algorithm at HD720p 3072/7168 kbit/s sequences. According to those parameters, we can use different configurations to reach different requirements for various applications. For example, the portable devices usually need low power and acceptable quality. We can use the larger down-sample ratio, less number of candidates, and smaller local full search range to reduce the power consumption. For the high definition television applications, the delicate image quality is the major design consideration. Then, we can use a smaller down-sample ratio, a more number of candidates, and a larger local full search range to generate accurate motion vectors in video encoding. Fig. 5 - Fig. 8 show the PSNR performance under different HDLFS-IME parameters and different bit rates compared to FSBM-IME in JM9.3 with the following settings: Intra Period: 1I29P, Search Range: 32x32, Reference frame number: 1, Rate Distortion Optimization (RDO): Off, Rate Control: Frame-based rate control. The test sequences consist of Fig. 9. The block diagram of the proposed IME architecture based on the HDLFS-IME algorithm. QCIF(176x144), CIF(352x288), SDTV(720x480), and HD720(1280x720) video formats. From the figures, we conclude the following facts. First, the larger Down Sample Ratio is, the larger PSNR drop is. Second, the more Candidate Number is, the smaller PSNR drop is. Third, the larger Local Full Search Range is, the smaller PSNR drop is. Fig. 9 shows the proposed IME architecture based on the HDLFS-IME algorithm. Beneficial from the low complexity

5 TCSVT st stage 2nd stage (a) (b) Fig. 10. (a) The architecture of the proposed processing element (PE); (b) The architecture of the proposed absolute difference (AD) module. feature of the proposed HDLFS-IME algorithm, the proposed IME architecture owns the feature of low hardware cost with good video quality. It also provides the feature of dynamic quality adjustability through the intelligent address generator and stage controllers which provide configurable parameters to configure the hardware architecture. In addition, we share the same hardware architecture in doing the IME operations of both stages according to the HDLFS-IME algorithm in the proposed design for the purpose of reducing hardware cost. The IME operations of the first and second stages in the HDLFS-IME algorithm are similar to each other, except the adoption of bit-truncation and pixel down sampling in the first stage. In the first stage, two processing elements (PEs) in SAD Calculator are used to process two 4x4 blocks at one cycle, so eight cycles are required for calculating the SAD of one MB, as shown in Fig. 10 (a). The dedicated PE architecture contains sixteen absolute difference (AD) modules. The input pixels of the current data and reference data are separated into two 4 bits for realizing pixel bit-truncation. Thus, the AD module in PEs can process two truncated pixels, such as MSB1 and MSB2 of the reference data in Fig. 10 (b), which can double the throughput rate. In the second stage, the reference pixels and current pixels will not be truncated to ensure the accuracy of SAD. After generating sixteen SADs for 4x4 blocks, the Mode Generator will generate the SADs for all the other modes, such as 16x16, 16x8, 8x16, 8x8, 8x4, and 4x8. Using TSMC 0.13um CMOS technology with the frequency constraint of 150MHz, the proposed design costs 59.8K gates, and requires 4Kbits Current pixel buffer and 22Kbits Reference pixel buffer for supporting search ranges of [-16, 16]. B. Fractional Motion Estimation In H.264 reference software JM9.3, the FME algorithm adopts the 2-step block matching operations to examine every candidate block in the search range, as shown in Fig. 11. The center integer pixel is the IME selected best-matched point of each MB partition. First, the 8 half pixels around the center pixel will be processed for selecting the best-matched half-precision pixel. Then, the 8 quarter pixels will be searched around the best-matched half-precision pixel to get the best-matched quarter-precision pixel of each partition. After the best-matched quarter-precision pixels of each MB partition Integer pixels Half pixels Quarter pixels Fig. 11. Sketch map of 2-step block matching search of FME in JM9.3. are selected, every MB combinational types will be examined and get the best-matched combination. Most existing FME designs [26-28] adopt this 2-step block matching algorithm for hardware realization by either adopting the array processing for high throughput rate or exploiting data reuse in the partial summation of absolute difference (SAD) for reducing the hardware cost. Another FME design [29] is based on an algorithm called A Single Iteration Fractional Motion Estimation Algorithm (SIFME), which totally searches for six candidates including two square points and 4 triangle points. No matter what kind of FME algorithms they used, the major idea is to predict the fractional motion vector direction in the first stage, and then search the several points around the best candidate during the first stage. They used [-0.75, +0.75] as the FME search range along both the X- and Y-directions. As previously described, we propose the HDLFS-IME algorithm to calculate the integer motion vectors for a MB. In Stage 2 of the HDLFS-IME algorithm, we search around the integer candidates with a local full search block matching operation. If we assume the best integer motion vector has been found in IME stage, the FME stage just needs to take [-0.5, +0.5] as the search range around this best integer candidate. It is not necessary to search an area larger than [-0.5, +0.5]. From the analysis shown above, in order to reduce the complexity in doing FME, we propose a Cluster Selection Fractional Motion Estimation algorithm (denoted as CS-FME algorithm) with the [-0.5, +0.5] search range. The proposed CS-FME algorithm adopts the integer motion vector (IMV) obtained from the proposed HDLFS-IME algorithm as the search center, and takes [-0.5, +0.5] as the search range along both X- and Y-directions. With full-search block matching algorithm, we perform the fractional motion estimation according to these 25 candidate points in each block mode. Fig. 12 shows Cluster Selection and Block Size Trend Prediction (BSTP) in the proposed CS-FME algorithm. The FME operations on the 41 modes of IME motion vectors are separated into two clusters, i.e. Cluster 1 (i.e. doing the FME operations on 16 16, 16 8, 8 16, and 8 8 block sizes) and Cluster 2 (i.e. doing the FME operations on 8 4, 4 8, and 4 4 block sizes). The adjustable qualities are provided by selecting different numbers of clusters for doing FME. In the proposed

TCSVT 3102 6 Fig. 12. Cluster Selection and BSTP algorithm in the proposed CS-FME algorithm. Fig. 14.

CS-FME algorithm, a BSTP scheme is adopted to skip the unnecessary FME operations on the IME modes of Cluster 2 if the IME cost in 16 16 mode is lower than that in 8 8 mode. Fig.

6 TCSVT Fig. 12. Cluster Selection and BSTP algorithm in the proposed CS-FME algorithm. Fig. 14. Comparison of PSNR performance with JM when using Cluster 1 + Cluster 2 in the proposed CS-FME algorithm. Fig. 13. Processing candidate points in the proposed CS-FME algorithm. CS-FME algorithm, a BSTP scheme is adopted to skip the unnecessary FME operations on the IME modes of Cluster 2 if the IME cost in mode is lower than that in 8 8 mode. Fig. 13 shows the 25 candidate points in the proposed CS-FME algorithm. Compared to JM9.3, the PSNR performance of the proposed CS-FME algorithm is shown in Fig. 14 for Cluster 1+Cluster 2 and Fig. 15 for Cluster 1 only. It has been mentioned previously that the FME search algorithm in JM9.3 is 2-stage search algorithm and the Lagrangian mode decision is adopted to determine the best MB partition. In addition, the test sequences consist of the video resolutions for QCIF, CIF, SDTV, and HD720. The other settings are shown as below: 1) search range: 32x32, 2) Intra period: 30 (1I29P), 3) RDO: Off, 4) Reference frame number = 1, 5) RC algorithm: Frame-based, 6) Bit rate are 128 ~ 512(kbits/s) for QCIF, 512 ~ 896(kbits/s) for CIF, 1024 ~ 1792(kbits/s) for SDTV, and 2048 ~ 2816(kbits/s) for HD720. Fig. 14 and fig. 15 show that the proposed CS-FME algorithm owns the better PSNR performance when the video resolution is larger. The architecture of the proposed quality adjustable CS-FME design is shown in Fig. 16. The MV/Mode SRAM stores the 41 integer motion vectors (IMVs) from IME. The Quality Adjustable Controller controls the order of the FME for different block-size partitions and performs mode selection decided by cluster parameter. The MV Cost Calculator can be divided into three parts. The first part is to calculate the cost of the reference frame number. This cost is always zero because we only support one reference frame. The second part calculates the cost of the motion vectors. The IMVs from MV/Mode SRAM will be transmitted to MV Cost Calculator to Fig. 15. Comparison of PSNR performance with JM when only using Cluster 1 in the proposed CS-FME algorithm. Fig. 16. Architecture of the proposed CS-FME design. calculate the cost of the motion vectors, which is corresponding to the 25 fractional pixels in the search range. The third part is used to calculate the SATD to get the cost of difference pixels. In this part, we first calculate the half and quarter pixels by interpolation unit. Then, we use these interpolated data to calculate the SATD. In order to generate the interpolated data efficiently for increasing the throughput, the Interpolation Unit can perform operations in dual directions and the SATD architecture can process 8x4 blocks instead of 4x4 blocks. When the Interpolation Unit performs the operations on 8x4 blocks, the horizontal interpolation and the vertical interpolation filters can process 14 integer pixels and 4 integer pixels in one cycle, respectively. This is why the FME Search Window SRAM (SWS) has to be partitioned into 6 banks with 4 pixels in a bank. Fig. 17 shows the example of the filtering operations between Interpolation Unit and Search Window SRAM (SWS). After the fractional interpolation operations, we use these interpolated data to perform the SATD calculation. Fig. 18 shows the architecture of the SATD Calculator. It

7 TCSVT Horizonta l pixels SWS Bank(i) SWS Bank(i+1) SWS Bank(i+2) SWS Bank(i+3) SWS Bank(i+4) Fig. 17. Example of filtering operations between Interpolation Unit and FME Search Window SRAM (SWS). Fig. 19. Proposed low complexity search algorithms for quality adjustable intra coding. Fig. 18. Architecture of SATD Calculator for 8x4 blocks. contains 20 Processing Elements (PEs) classified into two parts, i.e. 10 PEs for processing the left 4x4 block, and 10 PEs for processing the right 4x4 one. Using this architecture, we can accelerate the processing speed of the 16x16, 16x8, 8x16, 8x8 and 8x4 modes. As for the 4x8 and 4x4 modes, the 10 PEs located at the right side of SATD Calculator will be disabled to reduce the power consumption. Eventually, the Best Mode Selector decides the best combination for the FME. Using TSMC 0.13um CMOS technology with the frequency constraint of 150MHz, the proposed design costs 180.2K gates, and requires 1555 cycles/mb and 557 cycles/mb to realize the FME operations in (Cluster 1 + Cluster2) and Cluster 1, respectively. C. Intra Coding In the literature, there are some H.264 intra encoder designs [30-32] focusing on the optimization of mode decision scheduling or elimination of I16MB/Chroma plane prediction mode to reduce the processing cycles for low complexity intra coding. However, these designs only focus on reducing the intra coding complexity with acceptable video qualities. They do not discuss the possibility to support the flexible intra coding methods for power-aware video applications with the trade-off in video quality and power consumption. In the proposed design, we not only propose a flexible fast intra coding algorithm to dynamically adjust the video quality in terms of configurable parameters to exhibit different power consumption for different applications, but also exploit the common terms among the intra prediction of different modes (including I16MB/Chroma plane mode) to reduce the hardware cost. In order to realize the quality adjustability in hardware with a little quality loss, we propose two search techniques for luminance intra 4x4 mode decisions, i.e. Context Correlation Search Algorithm (CC-SA) and Probability Context Correlation Search Algorithm (PCC-SA) [33]. The CC-SA technique is to take advantages of the spatial correlation of Fig. 20. Comparison of PSNR performance with JM when using CC-SA, PCC-SA, and QMB-SA. intra texture between the current block and neighboring blocks. The CC-SA technique searches 4.8 modes per block in average. In addition, PCC-SA exploits the statistics of intra coding modes in real sequences to only search high probability modes for further reducing complexity. The PCC-SA technique can search less prediction modes (i.e. 3.7 modes per block in average compared to the CC-SA technique). Fig. 19 shows the proposed search algorithm with the detailed search modes. If the upper block and left block are both unavailable, we only do Mode 2 prediction. In addition, the Modes 1, 2, and 8 will be selected as candidates while the upper block is unavailable. The Modes 0, 2, 3, and 7 will be selected as candidates while the left block is unavailable. Then, if the upper block and left block are both available, we will search the modes listed in Fig. 19 according to the different search algorithms and the modes from the upper block & left block. Compared to full search algorithm of intra coding in H.264 reference software JM, adopting CC-SA and PCC-SA reduces 45% and 57% of computational complexity, respectively. For intra prediction on chrominance pixels, a Quarter MB Search Algorithm (QMB-SA) is proposed according to the observation that human eyes are less sensitive to the errors of chrominance pixels than luminance ones. Hence, we only perform the intra prediction on the left-top block instead of all four chrominance blocks in a MB. All of the above proposed Intra Coding search algorithms may cause the quality drop. Therefore, the comparison of each PSNR results for different proposed Intra Coding algorithms and QP is shown in Fig. 20. The simulation conditions are based on three different CIF videos including foreman, mobile, and stefan (300 I-frames) with Hadamard-based SATD mode decision. However, adopting

TCSVT 3102 8 Fig. 21. Architecture of the proposed quality adjustable intra coding design. Fig. 23. Example of the data dependence in a pipelined H.264 encoder design.

8 TCSVT Fig. 21. Architecture of the proposed quality adjustable intra coding design. Fig. 23. Example of the data dependence in a pipelined H.264 encoder design. i = 1, 5, 9 j = 1, 2, 3 (a) (b) Fig. 24. (a) Proposed MAD prediction pattern; (b) Proposed real bit prediction. Fig. 22. Proposed Intra Pixel Generator (IPG) architecture. (CC-SA + QMBSA) and (PCC-SA + QMBSA) increase 3.71% and 6.63% of bit-rate in average as compared to the full search in JM. The proposed quality adjustable intra coding architecture consists of Mode Decision Core (MDC) and Texture Coding Core (TTC), as shown in Fig. 21. In the MDC, Intra Pixel Generator (IPG) generates the intra predictors. Then, the SATD and Mode Decision unit compute the residual data to decide the best prediction mode. To support quality adjustability, we realize CC-SA, PCC-SA tables and QMB-SA algorithms in the mode decision controller. To reduce mode decision time, we use 2-D Hadamard transform in SATD calculation. It reduces 32 cycles as compared to the design using 1-D Hadamard transform. To reduce the hardware cost, we integrate the DCT, IDCT and Hadamard transform together into a multi-transform unit and optimize the IPG with the hardware sharing mechanisms including Shared Item Mechanism of Intra Pixel Generator (SIMIPG) and Plane Mode Sharing Mechanism (PMSM) [34]. In addition, although the plane mode in H.264 intra coding is complex, we realize it by sharing the IPG hardware for improving the video quality. Fig. 22 shows the proposed Intra Pixel Generator (IPG) architecture. D. Hardware Rate Control The BU-based rate control in H.264 reference software JM requires strong data dependency in encoding each MB, which causes the needed real bit sizes and Mean Absolute Difference (MAD) values to be unavailable when generating the quantization parameters in the MB-pipelined H.264 encoders. In the literatures, there have been many RC algorithms [35-38] proposed to improve the quality of H.264 JM rate control. However, all of these RC algorithms are implemented by software, and these algorithms need large amount of prediction data or complex RC model to achieve accurate estimation, which makes them difficult to be realized in a pipelined H.264 video encoder design without increasing latency induced by the sequential RC process. For example, as illustrated in Fig. 23, when the MB4 is coded in IME stage, it needs the real bit size and MAD values from the previous MBs (i.e. MB0~MB3) to generate QP. However, MB1, MB2, and MB3 are coded just in Entropy, Intra, and FME stages, respectively. It would cause the problem of unavailable values for real bit size and MAD when the rate control algorithm is realized in a pipelined H.264 encoder design. To solve this problem, we propose a real bit size prediction and MAD prediction. The MAD prediction for the current BU is obtained according to the BU prediction pattern from the previous frame shown in Fig. 24(a). The real bit size prediction is acquired by the proposed Lagrange model to have a good prediction according to the listed formula in Fig. 24(b) when the current BU is still during encoding. The minimize cost (d) is the Inter or Intra Cost from each previous MB shown in Fig. 25. For example, when MB4(n+1) is coded in IME stage, there are no FME Cost and Intra Cost in MB(4n+3). Therefore, we propose Eq. (5) and Eq. (6) to predict FME Cost and Intra Cost for the previous MBs. Then, the decided minimum cost is assigned to the formula of real bit size prediction to get the predicted bit size. The value of Threshold in Eq. (5) and Eq. (6)

TCSVT 3102 9 TABLE II COMPARISON OF PSNR AND TARGET BIT-RATE WITH JM FOR THE PROPOSED BU-BASED RC Fig. 25. Data Dependency of Minimum Inter Cost and Minimum Intra Cost. Fig. 26.

9 TCSVT TABLE II COMPARISON OF PSNR AND TARGET BIT-RATE WITH JM FOR THE PROPOSED BU-BASED RC Fig. 25. Data Dependency of Minimum Inter Cost and Minimum Intra Cost. Fig. 26. Block diagram of the proposed RC architecture. is a user-defined parameter to differentiate the slow motion or fast motion. According to the experience from our simulation, the ideal values of Threshold are 8, 8, and 16 for QCIF, CIF, and D1 video resolution, respectively. These two predictions successfully release the data dependency between MB encoding in the proposed H.264 video encoder to achieve the constant and stable bit-rate. This would induce a best effort bitrate utilization so as to achieve good video quality for video streaming applications. Table II shows the simulation results, and the other settings are shown as below: 1) RC algorithm: BU-based, 2) Basic Unit: 1 MB, 3) Intra period: 30 (1I29P), 3) RDO: off, 4) Reference frame number = 1, 5) Search range: 32x32. As shown in Table II, the proposed RC algorithm possesses almost the same video quality as compared to that in JM9.3. Fig. 25 also shows the RC pipelining in the proposed H.264 video encoder system. The first four MBs are assigned the initial QP because of the restriction of the prediction data from four-mb pipeline scheme. When the processed MB is in Entropy stage, the MAD value is calculated by the proposed MAD prediction. Then, Rate Distortion (RD) & MAD models will be updated for the next QP generation, and Lagrange model will be updated for real bit size prediction. Finally, Bit Allocation module and QP Generation will generate the suitable QP values at the beginning of IME stage. Fig. 26 shows the block diagram of the proposed RC architecture. The major block in the proposed architecture is ALU. The ALU consists of seven adders, two multipliers, one 16-cycle sequence divider, one 4-stage pipeline divider, one square-root unit and one QP generator. We adopt the CPU-like architecture to design the proposed RC architecture for sharing the arithmetic operators. Both of the QP generation and updating RC&MAD model are Fig. 27. Proposed high-throughput entropy coding architecture and FDSS scanning scheme. realized by ALU operations to reduce the hardware cost at the expense of the increasing processing cycles. Another major design consideration is to fit the cycle count budget for encoding each MB in the pipelined stage of the H.264 video encoder. The QP generation should be performed before IME for each MB. Therefore, it has 120 cycles to generate QP and has 300 cycles to update RC&MAD model after finishing Entropy Coding for each MB. According to theit only takes about 100 and 260 cycles to finish the tasks of QP generation and updating RC&MAD model, respectively. This performance could fit the requirement in the proposed design as mentioned above. E. ILF & Entropy Coding The ILF is a 4x4-block based architecture with a horizontal-vertical interleaved raster scan order in filtering

TCSVT 3102 10 TABLE III CYCLE COUNT FOR THE PROPOSED HDLFS-IME UNDER DIFFERENT COMBINATIONS OF DSR, CN AND LFSR Fig. 28. The example of FDSS scanning scheme. the data in a MB.

10 TCSVT TABLE III CYCLE COUNT FOR THE PROPOSED HDLFS-IME UNDER DIFFERENT COMBINATIONS OF DSR, CN AND LFSR Fig. 28. The example of FDSS scanning scheme. the data in a MB. According to the filtering order, the output order of filtered 4x4 block data is regular for both the address generation and data written to frame buffer located at the external memory using the burst mode access operations. About the filter design in ILF, it is similar to that used in H.264 video decoder [11]. Fig. 27 shows the high-throughput entropy coding (i.e. CAVLC) architecture consisting of Exp-Golomb Coding unit and Residual Engine. The Residual Engine is composed of Scanning Engine and Coding Engine. The processing bottleneck of Residual Engine lies in the residual data scanning, which requires 16 cycles for each 4x4 block, as indicated by the Traditional Scanning Scheme in Fig. 27. In order to remove this bottleneck, we propose a First-One Detecting Scanning Scheme (FDSS) technique implemented in Scanning Engine to fast detect the values of run_before for each non-zero coefficient, as indicated by FDSS in Fig. 27. A simple example for illustrating the FDSS is illustrated in Fig. 28. First, 16 coefficients (i.e. 14-bit) for one 4x4 block are represented in 16 1-bit signals. If the value of coefficient is zero, the signal is set to 0. Otherwise the signal is set to 1. Then, FDSS only scans these 16 1-bit signals to check if there are non-zero values, which not only improves the scanning throughput, but also avoids the increasing latency for critical path. Adopting the FDSS technique contributes about 6 times of throughput improvement as compared to the traditional one. F. Exploiting quality modes in H.264 video encoder The major goal in the proposed design is to achieve the quality adjustability with dynamic parameter configuration. With the proposed flexible quality adjustable algorithms for IME, FME, and Intra Coding modules, like HDLFS-IME algorithm, CS-FME algorithm, and CC-SA & PCC-SA & QMB-SA algorithms in Intra Coding, we can explore the relationship in the mapping of configuration parameters into the different quality modes. First, we analyze the processing cycle of each encoding module in the proposed design. The most complex one in the analysis is the proposed HDLFS-IME algorithm since there are three configuration parameters (i.e. Down-Sample Ratio denoted as DSR, Candidate Number denoted as CN, and Local Full Search Range denoted as LFSR), TABLE IV THE DEFINED QUALITY MODES IN THE PROPOSED DESIGN which induce a lot of combinations. Among the combinations, we have to decide which combinations are selected as the parameters in the quality modes of the proposed design. Table III shows the cycle analysis per MB under the different HDLFS-IME configuration parameters. For example, in Table III, the different columns marked with the same texture stand for their processing cycle are very close, but with different PSNR drop. As a result, we can observe from Table III and Fig. 5 to 8 in section that larger DSR for first estimating the motion vector trend combined with larger LFSR to determine the final motion vectors will have better video quality than the other combinations. Therefore, we adopt the parameters with less PSNR drop with the same timing budget. For the processing cycle of the proposed CS-FME design, adopting Cluster 1 only takes 557 cycles to finish the FME operations on a MB. On the other hand, adopting both Cluster 1 and Cluster2 takes 1555 cycles to finish the FME operations on a MB. On the intra coding, it respectively takes 1112 cycles, 760 cycles, and 626 cycles to finish the intra coding of a MB through full-search algorithm, CC-SA/QMB-SA search algorithm, and PCC-SA/QMB-SA search algorithm. At last,

TCSVT 3102 11 TABLE V SCALABILITY IN TRADING OFF DIFFERENT LOCAL MEMORY SIZES AND MEMORY BANDWIDTH OVERHEAD both the Entropy Coding and ILF take about 300 cycles for a MB in average.

11 TCSVT TABLE V SCALABILITY IN TRADING OFF DIFFERENT LOCAL MEMORY SIZES AND MEMORY BANDWIDTH OVERHEAD both the Entropy Coding and ILF take about 300 cycles for a MB in average. According to the above analysis, we define the four quality modes (i.e. QS0, QS1, QS2, and QS3) for dynamically configuring the proposed video encoder with frame by frame manner. Table IV shows the mapping of the quality modes to the parameters in the proposed algorithms. By setting the parameters of DSR, CN, LFSR, and numbers of Clusters, the design provides four quality modes in doing IME and FME operations. Exploiting the proposed CC-SA, PCC-SA, and QMB-SA algorithms to perform different numbers of intra-coding modes, we provide quality adjustability of QS0/QS1, QS2, and QS3 in intra-coding. Compared to JM [12], the proposed encoder design achieves 0.15dB, 0.16dB, 0.4dB, and 0.6dB of PSNR loss in average when operating at QS0, QS1, QS2, and QS3, respectively. G. Prediction Data Store Buffer (PDSB) In H.264 video coding, there are many data correlations between the current decoding MB and its neighboring decoded MBs. For example, the entropy coding needs information of Coded Block Pattern (CBP), Motion Vector Difference (MVD) and the upper row of 4x4 blocks in the neighboring MB s. In addition, intra coding needs the reconstructed pixels in the upper MB s. The ILF needs the unfinished filtered pixels in the upper MB s. If all the correlated data are stored in internal memory, there are about 19K/13K bytes of internal memory required for supporting HD1080/HD720 video encoding. For reducing this requirement of local memory in the proposed design, we adopt a DMA-like PDSB design [11] to collect the correlated data in a MB and store them in external memory if they are not used immediately. In the proposed H.264 encoder, in order to avoid the overhead in accessing prediction data through AMBA AHB interface, we always store the prediction data of entropy coding like coded block pattern, motion vector difference and the number of non-zero coefficients in local memory. Other prediction data (i.e. Intra and ILF prediction data) are accessed from the external memory through PDSB scheme and AMBA interface. Of course, there is a trade-off in the size of the required internal memory and the increase in the external memory bandwidth when using the proposed PDSB scheme as shown in Table V. This trade-off provides some design flexibility for system designers to decide which configuration is suitable for the implementation based on the choice of fabricated technology. (a) Fig. 29. (a) FPGA Prototyping of the proposed design; (b) The environment for chip testing. III. DESIGN IMPLEMENTATION In developing the proposed H.264 video encoder, we adopt the Concurrent Versions System (CVS) tool for file version control to improve the communication efficiency during the design process. The detailed implementation of the proposed design is described in the following sub-sections. A. Design flow and methodology The proposed design is implemented in VERILOG Hardware Description Language (HDL) coding with SPRINGSOFT nlint checking, SYNOPSYS logic synthesis, NANOSIM pre-layout simulation, and NANOSIM post-layout simulation according to the TSMC 1P8M 0.13 m CMOS technology. In RTL coding and verification, the code coverage tool, i.e. VN-Navigator, is used to analyze the code coverage of the provided test-benches. In order to speed up integrating the sub-modules of the proposed design, we have built a verification platform containing the proposed design and the bus functional models (BFMs) for RISC processor and external SDR memory for co-simulation. For quickly fixing the bugs encountered in system integration, we compare the intermediate results of each module in the proposed H.264 video encoder with the associated test patterns dumped from the H.264 reference software JM automatically in the proposed verification platform. This verification platform can be used to verify the proposed design in forms of RTL modeling and gate-level modeling. In addition, the assertion-based verification technique is also adopted to facilitate the debug process. The Open Verification Library (OVL) [39] is used as the assertion tool and the errors can be quickly detected and then fixed. B. Prototype of the proposed design In system level verification, we use the board level testing on the proposed design with the help of the ARM-based FPGA platform (FIE8100) from Faraday Inc. In order to verify the functional correctness of the proposed design, we have adopted over 100 testing sequences in different quality modes and parameters to verify the proposed design in FPGA prototyping, as shown in Fig. 29(a). The limitation of system clock (for AMBA and the proposed H.264 encoder) and CPU clock in the ARM-based platform is 40MHz and 200MHz, respectively. (b)

TCSVT 3102 12 First of all, we feed in the test sequences (YUV file) into SDRAM from SD card.

12 TCSVT First of all, we feed in the test sequences (YUV file) into SDRAM from SD card. Then, CPU activates the proposed encoder to start encoding frame by frame and receives/clears interrupts from the proposed encoder when finishing each frame encoding. The reconstructed reference data from the encoder are then written to display memory for 320x240 LCD. Finally, the proposed encoder writes the bit-stream into SD card through AMBA interface. C. Chip testing To measure the performance of the proposed H.264 video encoder chip, we use the chip testing environment shown in Fig. 29(b) to measure the power consumption and maximum operating clock frequency of the proposed design. In this environment, we provide the input data from AMBA interface through pattern generator and obtain the encoded bitstream from the external memory interface to ensure the chip correctness. According to the measurements, the proposed design operates at 10MHz for supporting CIF encoding with 7mW at 0.7V in QS3 mode. The power consumption is 183mW at 1.2V when it is operated at 108MHz for supporting HD720 in QS2 mode. IV. PERFORMANCE ANALYSIS Fig. 30 shows the performance of the proposed design with four quality modes. It can respectively encode D1 and HD720 videos at clock rates of 30/40/60/96MHz and 72/108MHz, respectively, at different quality levels with less than 0.6dB of PSNR loss in average. The feature of adjustable quality allows the design to adjust its encoding quality by trading-off different amounts of power consumption. It achieves different video recording time with finite battery charge when it is used in power adaptive coding applications like portable digital video recorders. Fig. 31 summarizes the chip implementation. The core size is 4.3x4.3mm 2 and includes 470Kgates and 13.3Kbytes of internal memory. The power consumption is 7mW-to-25mW, 27mW-to-163mW, and 122mW-to-183mW for encoding CIF@30fps, D1@30fps, and HD720 video@30fps with different quality modes, respectively. The maximum performance of the proposed design achieves encoding HD1080 video@20fps when it is operated at 108MHz in QS3 mode. Fig. 31 also compares the proposed design with the existing H.264 video encoders [3, 5, 7, 8]. The designs [3, 5, 7] support the H.264 baseline profile video coding tools. The design [8] supports both the H.264 baseline profile and part of the high profile coding tools like Context Adaptive Binary Arithmetic Coding (CABAC), 8x8 blocks, and B-frame coding, which is targeted at HDTV applications. Adopting simplified algorithms for IME and FME, the design [8] exhibits the performance to support HD1080 video encoding with moderate hardware cost and acceptable video quality. The chip micrograph is shown in Fig. 32. In addition to owning the good feature of dynamic quality adjustability, the proposed design owns the features of low hardware cost in terms of about Fig. 30. Quality modes in the proposed design. Fig. 31. Comparison with the state-of-the-art H.264 video encoders. Fig. 32. Chip micrograph and specification. 6~49% reduction in gate-count and 21%~61% reduction in internal memory. When operating at 10MHz for CIF video encoding, its power consumption is only 7mW, which is comparable to the 5mW reported in the state-of-the-art MPEG-4 encoder [4] for CIF video. Moreover, with quality-adjustable flexibility this design can be applied to power-adaptive video coding applications by trading off video quality and power consumption dynamically. V. CONCLUSION AND FUTURE WORKS In this paper, we have presented a dynamic quality-adjustable H.264 video encoder for both high definition and portable video applications. Exploiting the proposed parameterized algorithms for motion estimation and intra coding, the proposed design can dynamically configure the encoding modes with the design trade-off between power consumption and video quality for various video encoding applications. Using the TSMC 0.13 m 1P8M CMOS technology, the proposed design costs 470Kgates/13.3Kbytes

13 TCSVT SRAM and achieves H.264 encoding on and HD720 with different quality modes. The proposed design is much more flexible than the existing H.264 video encoders due to the provided dynamic quality adjustability. The proposed design can be implemented with the more advanced technology like 90nm CMOS to achieve the real-time encoding on video. ACKNOWLEDGEMENT The authors express the immense gratitude to the National Science Council (NSC) and the Chip Implementation Center (CIC) of Taiwan for the financial budget support under grant: NSC E and the chip fabrication support, respectively. REFERENCES [1] Advanced Video Coding, ISO/IEC amd ITU-T Rec. H.264, [2] L. E. G. Richardson, H.264 and MPEG-4 Video Compression - Video Coding for Next-generation Multimedia, JohnWiley & Sons Inc., [3] Y. W. Huang, T. C. Chen, C. H. Tsai, C. Y. Chen, T. W. Chen, C. S. Chen, C. Fu. Shen, S. Y. Ma, T. C. Wang, B. Yu. Hsieh, H. C. Fang, L. G. Chen, "A 1.3TOPS H.264/AVC single-chip encoder for HDTV applications," Proc. IEEE International Solid-State Circuits Conference, pp , Febuary [4] C. P. Lin, P. C. Tseng, Y. T. Chiu, S. S. Lin, C. C. Cheng, H. C. Fang, W. M. Chao, L. G. Chen, "A 5mW MPEG4 SP encoder with 2D bandwidth-sharing motion estimation for mobile applications," Proc. IEEE International Solid-State Circuits Conference, pp , Febuary [5] T. C. Chen, Y. H. Chen, C. Y. Tsai, S. F. Tsai, S. Y. Chien, L. G. Chen, "2.8 to 67.2mW Low-Power and Power-Aware H.264 Encoder for Mobile Applications," Proc. IEEE International Symposium on VLSI Circuits, pp , June [6] "H.264 software IP suite for DSP C64xx" from ATEME, [7] MM5010: "A Fully Hardwired H264 Encoder IP" from MMChips, product_mm5010.html. [8] Y. K. Lin, D. W. Li, C. C. Lin, T. Y. Kuo, S. J. Wu, W. C. Tai, W. C. Chang, and T. S. Chang, " A 242mW 10mm p H.264/AVC High-Profile Encoder Chip," Proc. IEEE International Solid-State Circuits Conference, pp , Febuary [9] Z. Liu, Y. Song, M. Shao, S. Li, L. Li, S. Ishiwata, M. Nakagawa, S. Goto, T. Ikenaga, "A 1.41W H.264/AVC Real-Time Encoder SOC for HDTV1080P," Proc. IEEE International Symposium on VLSI Circuits, pp , June [10] S. Mochizuki, T. Shibayama, M. Hase, F. Izuhara, K. Akie, M. Nobori, R. Imaoka, H. Ueda, K. Ishikawa, and H. Watanabe, "A 64 mw High Picture Quality H.264/MPEG-4 Video Codec IP for HD Mobile Applicationsin 90 nm CMOS," IEEE Journal of Solid-State Circuits, vol. 43, no. 11, pp , November [11] C. C. Lin, J. I. Guo, H. C. Chang, Y. C. Yang, J. W. Chen, M. C. Tsai, J. S. Wang, and J. I. Guo, A 160kGate 4.5kB SRAM H.264 Video Decoder for HDTV Applications, Proc. IEEE International Solid-State Circuits Conference, pp , Febuary, [12] Joint Video Team Reference Software JM9.3, suehring/tml/doc/, June [13] S. Y. Yap and J. V. McCanny, A VLSI architecture for variable block size video motion estimation, IEEE Transactions on Circuits and Systems for Video Technology, vol. 51, no. 7, pp , July 2004 [14] S. Y. Yap and J. V. Mccanny, A VLSI architecture for advanced video coding motion estimation, Proc. IEEE International Application-specific Systems, Architectures, and Processors Conference, pp , June [15] S. Lopez, F. Tobajas, A. Villar, V. de Armas, J. F. Lopez, and R. Sarmiento, Low cost efficient architecture for H.264 motion estimation, Proc. IEEE International Symposium on Circuits and Systems, vol. 1, pp , May [16] M. Kim, I. Hwang, and S. I. Chae, A fast VLSI Architecture for Full-Search Variable Block Size Motion Estimation in MPEG-4/H.264, Proc. Asia and South Pacific Design Automation Conference, vol. 1, pp , January [17] Y. W. Huang, T. C. Wang, B. Y. Hsieh, and L. G. Chen, Hardware architecture design for variable block size motion estimation in MPEG-4 AVC/JVT/ITU-T H.264, Proc. IEEE International Symposium on Circuits and Systems, vol. 2, pp , May [18] T. Komarek, and P. Pirsch, Array architectures for block matching algorithms, IEEE Transactions on Circuits and Systems for Video Technology, vol. 36, no. 10, pp , October [19] L. de Vos, and M. Stegherr, Parameterizable VLSI architectures for the full-search block-matching algorithm, IEEE Transactions on Circuits and Systems for Video Technology, vol. 36, no. 10, pp , October [20] H. M. Jong, L. G. Chen, and T. D. Chiueh, Parallel Architectures for 3-Step Hierarchical Search Block-Matching Algorithm, IEEE Transactions on Circuits and Systems for Video Technology, vol. 4, no. 4, pp , Auguest [21] S. Hamalainen, L. Koskinen, and K. Halonen, A hardware-based predictive motion estimation algorithm, Proc. IEEE International Symposium on Circuits and Systems, vol. 6, pp , July [22] K. B. Lee, H. Y. Chin, H. C. Hsu, and C. W. Jen, QME: an efficient subsampling-based block matching algorithm for motion estimation, Proc. IEEE International Symposium on Circuits and Systems, vol. 2, pp. II-305-8, May [23] H. Y. Chin, C. C. Cheng, Y. K. Lin, and T. S. Chang, A bandwidth efficient subsampling-based block matching architecture for motion estimation, Proc. Asia and South Pacific Design Automation Conference, vol. 2, pp. D/7-D/8, January [24] L. Fanucci, S. Saponara, and L. Bertini, A parametric VLSI architecture for video motion estimation, Integration, the VLSI Journal, vol. 31, no. 1, pp (22), November [25] C. L. Su, Y. C. Yang, C. W. Chen, W. S. Yang, Y. L. Chen, S. Y. Tseng, and J. I. Guo, A Low Complexity High Quality Integer Motion Estimation Architecture Design for H.264/AVC, Proc IEEE Asia-Pacific Conference on Circuits and Systems, pp , December [26] T. C. Chen, Y. W. Huang, L. G. Chen, Fully Utilized and Reusable Architecture for Fractional Motion Estimation of H.264/AVC, Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, pp , May [27] C. Yang and S. Goto, High Performance VLSI Architecture of Fractional Motion Estimation in H.264 for HDTV, Proc. IEEE International Symposium on Circuits and Systems, pp , May [28] Y. Y. Wang and C. J. Tsai, An Efficient Dual-interpolator Architecture for Sub-pixel Motion Estimation, Proc. IEEE International Symposium on Circuits and Systems, pp , May [29] T. Y. Kuo, Y. K. Lin, and T. S. Chang, SIFME: Single Iteration Fractional-pel Motion Estimation Algorithm and Architecture for HDTV Sized H.264 Video Coding, Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol.1, pp , April [30] C. C. Cheng, C. W. Ku, and T. S. Chang, A 1280x720 Pixels 30 Frames/s H.264/MPEG-4 AVC Intra Encoder, Proc. IEEE International Symposium on Circuits and Systems, May 2006 [31] Y. W. Huang, B. Y. Hsieh, T. C. Chen, and L. G. Chen, Analysis, fast algorithm, and VLSI architecture design for H.264/AVC intra frame coder, IEEE Transactions on Circuits and Systems for Video Technology, pp , March [32] D. W. Li, C. W. Ku, C. C. Cheng, Y. K. Lin, and T. S. Chang, A 61MHz 72K Gates 1280x720 30fps H.264 Intra Encoder, Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol.2, pp. II-801-II-804, April [33] J. W. Chen, C. H. Chang, C. C. Lin, Y. H. Ou Yang, J. I. Guo, and J. S. Wang, A Condition-based Intra Prediction Algorithm for H.264/AVC, Proc IEEE International Conference on Multimedia & Expo, pp , July [34] C. H. Chang, J. W. Chen, H. C. Chang, Y. C. Yang, J. S. Wang, J. I. Guo, A Quality Scalable H.264/AVC Baseline Intra Encoder for High Definition Video Applications, Proc. IEEE Workshop on Signal Processing Systems, pp , October [35] X. Yi, and N. Ling, Rate control using enhanced frame complexity measure for H.264 video, Proc. IEEE Workshop on Signal Processing Systems, pp , October 2004.

TCSVT 3102 14 [36] M. Jiang, X. Yi, N. Ling, Improved frame-layer rate control for H.264 using MAD ratio, Proc. IEEE International Symposium on Circuits and Systems, vol.3, pp. III-813-6, May 2004.

Zhou, An improved Basic-Unit Layer Rate-Control Scheme on H.264, Proc. Sixth International Conference on Parallel and Distributed Computing, Applications and Technologies, pp. 815-819, December 2005.

degrees in Department of Information Management and M.S.

degree at the Graduate Institute of Computer Science and Information Engineering, National Chung Cheng University.

14 TCSVT [36] M. Jiang, X. Yi, N. Ling, Improved frame-layer rate control for H.264 using MAD ratio, Proc. IEEE International Symposium on Circuits and Systems, vol.3, pp. III-813-6, May [37] H. Yu, Z. Lin, and F. Pan, An improved rate control algorithm for H.264, Proc. IEEE International Symposium on Circuits and Systems, vol.1, pp , May [38] S. Su, S. Yu, and J. Zhou, An improved Basic-Unit Layer Rate-Control Scheme on H.264, Proc. Sixth International Conference on Parallel and Distributed Computing, Applications and Technologies, pp , December [39] Open Verification Library (OVL) in Bing-Tsung Wu was born in Penghu, Taiwan, R. O. C., in He received the B.S. degrees in Department of Information Management and M.S. degrees in Graduate School of Business and Operations Management from Chang Jung Christian University, Tainan, Taiwan, in 2003 and 2005 respectively. He is currently working toward the Ph.D. degree at the Graduate Institute of Computer Science and Information Engineering, National Chung Cheng University. His research interests include video processing, VLSI architectures, digital IP design and video rate control.. design. Hsiu-Cheng Chang was born in Tainan, Taiwan, R. O. C., in He received the B.S. and M.S. degrees in Department of Computer Science and Information Engineering from National Chung Cheng University, Chia Yi, Taiwan, in 2003 and 2005 respectively. He is currently working toward the Ph.D. degree at the Graduate Institute of Computer Science and Information Engineering, National Chung Cheng University. His research interests include video processing, VLSI architectures, digital IP design and multimedia SOC Jia-Wei Chen received the B. S. degree in electronics engineering form National Lien Ho Institute of Technology, Miao-Li, Taiwan, in 2003, and M.S. degrees from the electrical engineering at National Chung Cheng University Chia-Yi, Taiwan, in He is currently working toward the Ph.D. degree in electrical engineering at National Chung Cheng University Chia-Yi, Taiwan. His research interests include video processing, very large scale integration architecture design, digital IP design, and silicon-on-chip design. Ching-Lung Su was born in Taipei, Taiwan, R. O. C., in He received the B.S. degree from the Department of Electrical Engineering, Chinese Culture University, Taipei, Taiwan, the M.S. degree from the Graduate Institute of Electronics and Computer Science Engineering, National Yunlin University of Science & Technology, Yunlin, Taiwan, and the Ph.D degree from the Graduate Institute of Electronics Engineering, National Chiao Tung University, Hsinchu, Taiwan, in 1994, 1996, and 2003, respectively. In 2004, he joined the Department of Electronics Engineering, National Yunlin University of Science & Technology, as an assistant professor. During 2007 to 2008, he is on leave from National Yunlin University of Science & Technology and serves as the Technical Deputy Director of the Processor and Application Division, SoC Technology Center (STC), Industrial Technology Research Institute (ITRI), Hsinchu, Taiwan. Since 2008, he is also a consultant of STC/ITRI. His research interests include embedded software for SoC, video signal processing, digital IC architecture design, multi-core embedded system and multimedia digital signal processor design. Jinn-Shyan Wang (S 85-M 88) was born in Taiwan, R.O.C., in He received the B.S. degree in electrical engineering from the National Cheng-Kung University, Tainan, Taiwan, in 1982 and the M.S. and Ph.D. degrees from the Institute of Electronics, National Chiao-Tung University, Hsinchu, Taiwan, in 1984 and 1988, respectively. He was with Industrial Technology Research Institute (ITRI) from , engaged in ASIC circuit and system design, and became the Manager of the Department of VLSI Design. He joined the Department of Electrical Engineering, National Chung-Cheng University, Chia-Yi, Taiwan, in 1995, where he is currently a full Professor. His research interests are in low-power and high-speed digital integrated circuits and systems, analog integrated circuits, IP and SOC design, and CMOS image sensors. He has published over 20 journal papers and 40 conference papers and holds over 20 patents on VLSI circuits and architectures. Jiun-In Guo was born in Kaohsiung, Taiwan, R.O.C. in He received the B.S. degree and Ph.D. degree in electronics engineering from National Chiao Tung University, Hsinchu, Taiwan, in 1989 and 1993, respectively. He is currently a Professor of the Department of Computer Science and Information Engineering, National Chung-Cheng University, Chiayi, Taiwan. He is now the research distinguished professor of National Chung-Cheng University from 2008 to He joined the System-on-Chip Research Center since March 2003 to start involving in several Grand Research Projects on low-power, high-performance processor design and multimedia IP/SOC design. He was the director of SOC Research Center, National Chung-Cheng University from 2005 to He was an Associate Professor of the Department of Computer Science and Information Engineering, National Chung-Cheng University from 2001 to 2003 and an Associate Professor of the Department of Electronics Engineering, National Lien-Ho Institute of Technology from 1994 to And he was the director of the Department of Electronics Engineering, National Lien-Ho Institute of Technology from 1996 to Dr. Guo was the recipient of the National Science Council (NSC) Research Award in 1996 and He was the recipient of the 2003 MXIC Young Professor Award for his contributions to the course of low-power Multimedia/DSP Silicon IP Design. He was also the recipient of the 2004 Chinese Institute of Electrical Engineering (CIEE) Outstanding Youth Electrical Engineer Award and the recipient of the 2008 Chinese Institute of Electrical Engineering (CIEE) Tai-Chung Section Outstanding Engineering Professor Award to recognize his excellent contributions to R&D and service of electrical engineering. He was also the recipient of the 2006 Outstanding Research Award of National Chung Cheng University. He has published over 120 technical papers on the research areas of low-power and low cost algorithm and architecture design for DSP/Multimedia signal processing applications. His research team has won over 25 IC related student design contest awards from 2003 to His research interests include image, multimedia, and digital signal processing, VLSI algorithm/architecture design, digital SIP design, and SOC design.

High Performance VLSI Architecture of Fractional Motion Estimation for H.264/AVC

Journal of Computational Information Systems 7: 8 (2011) 2843-2850 Available at http://www.jofcis.com High Performance VLSI Architecture of Fractional Motion Estimation for H.264/AVC Meihua GU 1,2, Ningmei