High-Performance VLSI Architecture of H.264/AVC CAVLD by Parallel Run_before Estimation Algorithm *

Similar documents
High-Throughput Parallel Architecture for H.265/HEVC Deblocking Filter *

Design of a High Speed CAVLC Encoder and Decoder with Parallel Data Path

Architecture of High-throughput Context Adaptive Variable Length Coding Decoder in AVC/H.264

FPGA based High Performance CAVLC Implementation for H.264 Video Coding

A 4-way parallel CAVLC design for H.264/AVC 4 Kx2 K 60 fps encoder

A Dedicated Hardware Solution for the HEVC Interpolation Unit

Smart Bus Arbiter for QoS control in H.264 decoders

Fast frame memory access method for H.264/AVC

Fast Decision of Block size, Prediction Mode and Intra Block for H.264 Intra Prediction EE Gaurav Hansda

Optimized architectures of CABAC codec for IA-32-, DSP- and FPGAbased

Design of Entropy Decoding Module in Dual-Mode Video Decoding Chip for H. 264 and AVS Based on SOPC Hong-Min Yang, Zhen-Lei Zhang, and Hai-Yan Kong

An Efficient Table Prediction Scheme for CAVLC

White paper: Video Coding A Timeline

A COMPARISON OF CABAC THROUGHPUT FOR HEVC/H.265 VS. AVC/H.264. Massachusetts Institute of Technology Texas Instruments

High Performance VLSI Architecture of Fractional Motion Estimation for H.264/AVC

NEW CAVLC ENCODING ALGORITHM FOR LOSSLESS INTRA CODING IN H.264/AVC. Jin Heo, Seung-Hwan Kim, and Yo-Sung Ho

Low power context adaptive variable length encoder in H.264

Deblocking Filter Algorithm with Low Complexity for H.264 Video Coding

One-pass bitrate control for MPEG-4 Scalable Video Coding using ρ-domain

A Memory Management Scheme for Hybrid Memory Architecture in Mission Critical Computers

A Novel Deblocking Filter Algorithm In H.264 for Real Time Implementation

[30] Dong J., Lou j. and Yu L. (2003), Improved entropy coding method, Doc. AVS Working Group (M1214), Beijing, Chaina. CHAPTER 4

Chapter 2 Joint MPEG-2 and H.264/AVC Decoder

A Spatial Point Pattern Analysis to Recognize Fail Bit Patterns in Semiconductor Manufacturing

High Efficiency Video Decoding on Multicore Processor

A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup

Fast CAVLD of H.264/AVC on bitstream decoding processor

A SCALABLE COMPUTING AND MEMORY ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION ON FIELD-PROGRAMMABLE GATE ARRAYS. Theepan Moorthy and Andy Ye

AVC Entropy Coding for MPEG Reconfigurable Video Coding

ISSCC 2006 / SESSION 22 / LOW POWER MULTIMEDIA / 22.1

Reduced 4x4 Block Intra Prediction Modes using Directional Similarity in H.264/AVC

H.264 / AVC (Advanced Video Coding)

(JBE Vol. 23, No. 6, November 2018) Detection of Frame Deletion Using Convolutional Neural Network. Abstract

H.264/AVC BASED NEAR LOSSLESS INTRA CODEC USING LINE-BASED PREDICTION AND MODIFIED CABAC. Jung-Ah Choi, Jin Heo, and Yo-Sung Ho

A comparison of CABAC throughput for HEVC/H.265 VS. AVC/H.264

Upcoming Video Standards. Madhukar Budagavi, Ph.D. DSPS R&D Center, Dallas Texas Instruments Inc.

A 120 fps High Frame Rate Real-time Video Encoder

Adaptation of Scalable Video Coding to Packet Loss and its Performance Analysis

Efficient VLSI Huffman encoder implementation and its application in high rate serial data encoding

Real-time and smooth scalable video streaming system with bitstream extractor intellectual property implementation

PAPER Optimal Quantization Parameter Set for MPEG-4 Bit-Rate Control

Implementation of Optimized ALU for Digital System Applications using Partial Reconfiguration

High Efficiency Video Coding: The Next Gen Codec. Matthew Goldman Senior Vice President TV Compression Technology Ericsson

Efficient Constant-time Entropy Decoding for H.264

STUDY AND IMPLEMENTATION OF VIDEO COMPRESSION STANDARDS (H.264/AVC, DIRAC)

An HEVC Fractional Interpolation Hardware Using Memory Based Constant Multiplication

A VLSI Architecture for H.264/AVC Variable Block Size Motion Estimation

Bit Allocation for Spatial Scalability in H.264/SVC

QUANTIZER DESIGN FOR EXPLOITING COMMON INFORMATION IN LAYERED CODING. Mehdi Salehifar, Tejaswi Nanjundaswamy, and Kenneth Rose

A COST-EFFICIENT RESIDUAL PREDICTION VLSI ARCHITECTURE FOR H.264/AVC SCALABLE EXTENSION

A METHOD FOR DETECTING FALSE POSITIVE AND FALSE NEGATIVE ATTACKS USING SIMULATION MODELS IN STATISTICAL EN- ROUTE FILTERING BASED WSNS

Complexity Reduced Mode Selection of H.264/AVC Intra Coding

High Efficiency Video Coding (HEVC) test model HM vs. HM- 16.6: objective and subjective performance analysis

POWER CONSUMPTION AND MEMORY AWARE VLSI ARCHITECTURE FOR MOTION ESTIMATION

Golomb Coding Implementation in FPGA

Signal Processing: Image Communication

FPGA IMPLEMENTATION OF SUM OF ABSOLUTE DIFFERENCE (SAD) FOR VIDEO APPLICATIONS

Improving Energy Efficiency of Block-Matching Motion Estimation Using Dynamic Partial Reconfiguration

Multimedia Decoder Using the Nios II Processor

Improving the quality of H.264 video transmission using the Intra-Frame FEC over IEEE e networks

CPU-GPU hybrid computing for feature extraction from video stream

EFFICIENT FAILURE PROCESSING ARCHITECTURE IN REGULAR EXPRESSION PROCESSOR

An Efficient Mode Selection Algorithm for H.264

Transcoding from H.264/AVC to High Efficiency Video Coding (HEVC)

Delay Constrained ARQ Mechanism for MPEG Media Transport Protocol Based Video Streaming over Internet

Video Coding Using Spatially Varying Transform

Sample Adaptive Offset Optimization in HEVC

Video coding. Concepts and notations.

A Multi-Standards HDTV Video Decoder for Blu-Ray Disc Standard

Advanced Encoding Features of the Sencore TXS Transcoder

Learning C language Programming with executable flowchart language

Video Coding Standards. Yao Wang Polytechnic University, Brooklyn, NY11201 http: //eeweb.poly.edu/~yao

H.264/AVC Baseline Profile to MPEG-4 Visual Simple Profile Transcoding to Reduce the Spatial Resolution

Design & Analysis of 16 bit RISC Processor Using low Power Pipelining

EFFICIENT DEISGN OF LOW AREA BASED H.264 COMPRESSOR AND DECOMPRESSOR WITH H.264 INTEGER TRANSFORM

OVERVIEW OF IEEE 1857 VIDEO CODING STANDARD

A NOVEL SCANNING SCHEME FOR DIRECTIONAL SPATIAL PREDICTION OF AVS INTRA CODING

FPGA Implementation of Rate Control for JPEG2000

A Network Storage LSI Suitable for Home Network

Advanced Video Coding: The new H.264 video compression standard

EFFICIENT PU MODE DECISION AND MOTION ESTIMATION FOR H.264/AVC TO HEVC TRANSCODER

A Study on Transmission System for Realistic Media Effect Representation

Keywords-H.264 compressed domain, video surveillance, segmentation and tracking, partial decoding

Reduced Frame Quantization in Video Coding

Optimizing the Deblocking Algorithm for. H.264 Decoder Implementation

MultiFrame Fast Search Motion Estimation and VLSI Architecture

THE latest video coding standard, H.264/advanced

International Journal of Emerging Technology and Advanced Engineering Website: (ISSN , Volume 2, Issue 4, April 2012)

Performance Analysis of DIRAC PRO with H.264 Intra frame coding

Vector Bank Based Multimedia Codec System-on-a-Chip (SoC) Design

High Speed Radix 8 CORDIC Processor

MPEG RVC AVC Baseline Encoder Based on a Novel Iterative Methodology

FIR Filter Synthesis Algorithms for Minimizing the Delay and the Number of Adders

Page Mapping Scheme to Support Secure File Deletion for NANDbased Block Devices

Design and Implementation of 3-D DWT for Video Processing Applications

HARDWARE BASED BINARY ARITHMETIC ENGINE

Intra Prediction Efficiency and Performance Comparison of HEVC and VP9

A 256-Radix Crossbar Switch Using Mux-Matrix-Mux Folded-Clos Topology

The Study of Genetic Algorithm-based Task Scheduling for Cloud Computing

Improved Context-Based Adaptive Binary Arithmetic Coding in MPEG-4 AVC/H.264 Video Codec

Transcription:

JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 29, 595-605 (2013) High-Performance VLSI Architecture of H.264/AVC CAVLD by Parallel Run_before Estimation Algorithm * JONGWOO BAE 1 AND JINSOO CHO 2,+ 1 Department of Information and Communication Engineering Myongji University Gyeonggi-Do, 449-728 Korea 2 Department of Computer Engineering Gachon University Gyeonggi-Do, 461-701 Korea A high-performance VLSI architecture for the H.264/AVC context-adaptive variable-length decoder (CAVLD) is proposed in order to reduce the computation time. The overall computation is pipelined, and a parallel processing is employed for high performance. For the run_before computation, the values of input symbols are estimated in parallel to check if their computation can be skipped in advance. Experimental results show that the performance of run_before is improved by 134% on average when four symbols are estimated in parallel, while the area of the VLSI implementation is only increased by 12% compared to a sequential method. The degree of parallelism employed for the estimation module is 4, and it can be changed easily. H.264/AVD is an essential technology for the multimedia engines of many consumer electronics applications, such as D-TVs and mobile devices. The proposed method contributes to the performance improvement of those applications. Keywords: H.264/AVC, CAVLD, run_before, skip estimation, VLSI design 1. INTRODUCTION H.264/AVC has replaced MPEG-2, and it is widely used for terrestrial, cable, and internet broadcasting in many countries. The frame rates and resolutions for TVs and mobile devices have been increasing rapidly, and interest in 3D movies has also developed recently. These factors have led to an explosive boost in the required amount of video data. Therefore, the performance of H.264/AVC must improve to meet the new requirements. Due to the sequential bottleneck of the entropy decoder, it is hard to improve the performance of H.264/AVC and other video codecs. Especially for real-time applications with a low-latency requirement, sufficient buffering is not guaranteed, and the bit rates can suddenly escalate sharply, which may cause the decoding to fail. H.264/AVC CAVLD consists of five steps: coeff_token, trailing_ones, level, total_zeroes, and run_before [1]. Among these steps, level and run_before computations are the most time-consuming. In general, if the number of levels is large in a macroblock, so is the number of run_befores; they are proportionally related. For the encoding of complex video scenes, both the computational load and the bit-rate of the encoded stream are increased. This leads to an increased number of macroblocks with a large number of Received June 20, 2012; accepted August 16, 2012. Communicated by Cho-Li Wang. * This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (2010-0025027). + Corresponding author. 595

596 JONGWOO BAE AND JINSOO CHO levels and run_befores. In most applications, the decoder can tolerate the temporary increase of computational load by buffering if the additional stress does not continue for too long. However, for real-time applications with a low-latency condition, the number of levels is increased in a macroblock, as with the number of zeroes of run_before values. The sudden escalation of the computational load is difficult for the decoder to handle. In order to overcome such problems, this paper proposes a method to improve the performance of CAVLD for H.264/AVC. Various methods have been presented for the computation of CAVLD. A method to modify the arithmetic computations for run_before was proposed by Chang [2]. He observed the regularity in the run_before look-up table (LUT) and changed the look-up table into four-step arithmetic calculations, which reduced computation time and power consumption. However, this method is basically sequential, and computation time is much slower than it would be for parallel methods. Nikara [3] proposed a parallel architecture for multi-symbol, variable-length decoding. An advanced memory reference technique was employed, and the on-chip memory was reduced. The proposed architecture offers performance that is proportional to the degree of parallelism provided; unfortunately, the price to pay for the increased parallelism is very high, as the area of run_before substantially increases. For this reason, a design with a high degree of parallelism seems to be impracticable. The method of skipping run_before with a value of zero was proposed by Tsa [4]. However, this method is also sequential, and the detailed architecture and experimental results were not clearly provided. A lot of research studies have been conducted on H.264/AVC recently [5-10]. Herein, we introduce a new algorithm and VLSI architecture to reduce the run_before computation time drastically at the cost of small area addition. The first four steps of the CAVLD explained before are implemented by employing the best known technology so far, which includes the parallel processing and pipelining techniques, introduced briefly at the beginning of the proposed work. We focus on the improvement of the final step of run_before computation. Since run_before is one of the most time-consuming steps of CAVLD, our work is an important contribution toward the improvement of CAVLD. This paper is organized as follows. In section 2, the related works are explained. In section 3, the proposed architecture is explained in detail. In section 4, experimental results are provided. Section 5 concludes this paper. 2. RELATED WORKS The flowchart of CAVLD for the hardware implementation is shown in Fig. 1. It consists of five computation steps: coeff_token, trailing_ones, level, total_zeroes, and run_before. Each step is generally executed sequentially. Step 1: Coeff_token The numbers of the total coefficients and the trailing ones are computed first, according to the look-up table and algorithm in [1]. This can be executed in parallel to the computation of the previous macroblock, thereby saving computation time.

HIGH-PERFORMANCE ARCHITECTURE OF H.264/AVC CAVLD 597 Step 2: Trailing ones The trailing ones are computed next. They are either 1 or 1, based on the corresponding input bit of zero or 1. The computation of trailing ones can be computed immediately by a simple logic design. Fig. 1. Flowchart of CAVLD. Step 3: Level The remaining non-zero symbols are decoded next. The level value is computed after the computation of level prefix, level suffix, and suffix length. The level computation is the most time-consuming, as well as the run_before computation. It is possible to compute two level symbols in parallel according to [5], and the computation speed can be doubled. The level decoding of one symbol takes more than one cycle for real-time HD video decoding. We can speed up the computation by pipelining the level computation into two stages, as well as the parallel computation. Step 4: Total zeroes The number of total zeroes is computed next, according to the look-up table in [1]. The computation of total zeroes can be overlapped with the final stage of the level computation, thereby saving computation time. Step 5: Run_before The final step of the CAVLD is run_before. It is the most time-consuming, except for level computation. The run_before computation can be implemented in parallel. Various methods have been developed for this step, and we introduce a new method in

598 JONGWOO BAE AND JINSOO CHO this paper. The data read and write procedure for CAVLD can be overlapped with the above five steps of CAVLD to further save computation time. By combining all the methods explained above, we can implement a very fast CAVLD with a small size. In this paper, we focus on a new method for run_before, employing a parallel estimation, and further improve the performance of CAVLD In a recent work [6], we introduced a run_before estimation method that works as follows. The run_before computation values of the next symbols are estimated in parallel with the computation for the current symbol of a macroblock. Fig. 2 shows the architecture of the method. The block labeled run_before core computes the run_before of the first input symbol. The estimation module shown as the dotted box computes the estimation for the next symbols, with various starting bit positions. If we compute the run_before computation value of the next symbol together with the current symbol, there are two uncertainty factors the starting bit position and the zeroes left value of the next symbol as the computation of the first symbol is not finished yet. Note that the zeroes left are defined as the number of remaining zeroes to insert after the current level is decoded [1]. In order to consider every possible combination of input symbols, the design of a parallel run_before estimation module is very complicated, albeit much simpler than other parallel design of a general run_before computation, such as Nikara s architecture [3]. The real parallel run_before computation has multiple LUTs for various starting bit positions, and zeroes left values for the next symbols. This leads to a large design with an accompanying long delay due to many LUTs and multiplexor layers. The large parallel-computation module is replaced with a small estimation module. Fig. 2. Run_before with parallel estimation in [6]. 3. PROPOSED ARCHITECTURE In this paper, we further improve the performance of run_before by estimating the run_before computation values of the input symbols in parallel while the estimated run values are zeroes. If the estimation fails, the real run value is computed by another module.

HIGH-PERFORMANCE ARCHITECTURE OF H.264/AVC CAVLD 599 We only check if the run_before of the next symbol is zero or not. If it is zero, we set the run_before value to zero and move to the next symbol. We repeat this until there is a non-zero run_before value, then stop. The non-zero run_before value is computed next. To continue, the check of the run_before with zero value is simple, because the input symbol is either 1, 11, or 111, depending on the zeroes left, as shown in Fig. 3. Note that the symbol is 1 if the zeroes left value is 1 or 2. The symbol is 11 if the zeroes left value is from 3 to 6. The symbol is 111 if the zeroes left value is greater than 6. Fig. 4 shows the flowchart of the proposed method. run_ zeroes left before 1 2 3 4 5 6 > 6 0 1 1 11 11 11 11 111 1 0 0 10 10 10 0 110 2 00 01 01 011 001 101 3 00 001 010 011 100 4 000 001 010 011 5 000 101 010 6 100 001 7 0001 8 00001 9 000001 10 0000001 11 00000001 12 000000001 13 0000000001 14 00000000001 15 000000000001 Fig. 3. Run_before decoding table. The operation of the flowchart is explained as follows. There are three different cases corresponding to each column of the flowchart. Each estimation starts at a different bit (bit-13 for Case 1, bit-12 for Case 2, and bit-11 for Case 3). The size of the input bitstream register implemented by a barrel shifter is assumed as 15 bits, and bit-14 is the most significant bit. The run_before core module does not execute before the estimation module. The estimation is performed until there is no skipping left, and the run_before core module works on the non-skipping run. This way, we can eliminate the wasted cycle of the run_ before core by running it at the front (before the estimation). The run_before core does not support parallel execution, and one cycle is consumed for one run symbol only, whether it is a skip or not. In addition to this, the whole architecture of the estimation becomes much simpler than the previous one because of the uncertainty of the zeroes left value by running the run_before core module beforehand. That is, the estimation module only depends on the zeroes left value given, and the estimation always starts at the first bit position, bit-14. The following cases provide a more detailed explanation.

600 JONGWOO BAE AND JINSOO CHO Fig. 4. Flowchart of the proposed run_before estimation. Case 1: The leftmost column of the flowchart in Fig. 2 shows the case where the zeroes left value is 1 or 2. The estimation starts at bit-14. According to the run_before decoding table in Fig. 3, the number of used bits is 1 only when the zeroes left value is 1 or 2. If input bit is 1, the run_before value is zero, the estimation result is skip, and the estimation of the next symbol continues. If input bit is 0, then the estimation stops, and the symbol with starting bit 0 is immediately computed by the run_before core module. In the flowchart in Fig. 2, the estimation of multiple symbols is shown as a sequential loop. When the estimation result is skip and the count is less than MAX, it goes back to the next input bit comparison. For parallel implementation, using the estimation of 4 symbols as an example, we check to see if the input bits are either 1, 11, 111, or 1111 at the same time. When the comparison result is false for the first time, the number of used input bits and the number of computed symbols of the position are returned. For example, if the comparison result is false at the position of 111, there are two used bits and two computed symbols. If all the comparison results are true, 4 symbols are estimated to have run_before values of zero, and there are 4 used bits and 4 computed symbols. This case is shown as block 1 of the estimation module in Fig. 5, and the detailed architecture of block 1 is shown in Fig. 6. Case 2: The middle column of the flowchart in Fig. 4 shows the case where the zeroes left value is from 3 to 6. The estimation starts at bit-14. According to the run_before decoding table in Fig. 3, the number of used bits is 2 only when the zeroes left value is

HIGH-PERFORMANCE ARCHITECTURE OF H.264/AVC CAVLD 601 Fig. 5. Proposed run_before architecture. Fig. 6. Architecture of block 1. from 3 to 6. If input bits are 11, the run_before value is zero, the estimation result is skip, and the estimation of the next symbol continues. If input bit is other than 11, then the estimation stops, and the symbol is immediately computed by the run_before core module. In the flowchart in Fig. 4, the estimation of multiple symbols is shown as a sequential loop. When the estimation result is skip and the count is less than MAX, it goes back to the next input bit comparison. Note that the number of used bits is 2 in this case. For parallel implementation, using the estimation of 4 symbols as an example, we check to see if the input bits are either 11, 1111, 111111, or 11111111 at the same time. When the comparison result is false for the first time, the number of used input bits and the number of computed symbols of the position are returned. This case is shown as block 2 in Fig. 5. The detailed architecture of block 2 is similar to that of block 1.

602 JONGWOO BAE AND JINSOO CHO Case 3: This case shows the drastic improvement of the proposed algorithm compared to the previous one. In the previous method, we needed to check the 9 potential bit positions, and therefore 9 estimation blocks were needed. We also showed that this could be avoided by checking the 000 value at first. However, the proposed algorithm in this paper does not require checking the 9 potential bit positions, because it does not have the preceding run_before core module, as shown in Fig. 5. It only depends on the zeroes left value from the total zeroes step. Case 3 is similar to the other two cases now. The rightmost column of the flowchart in Fig. 4 shows the case where the zeroes left value is greater than 6. The estimation starts at bit-14. According to the run_before decoding table in Table 1, the number of used bits is 3 only when the zeroes left value is greater than 6. If input bits are 111, the run_before value is zero, the estimation result is skip, and the estimation of the next symbol continues. If input bit is other than 111, then the estimation stops, and the symbol is immediately computed by the run_before core module. In the flowchart in Fig. 4, the estimation of multiple symbols is shown as a sequential loop. When the estimation result is skip and the count is less than MAX, it goes back to the next input bit comparison. Note that the number of used bits is 3 in this case. For parallel implementation, using the estimation of 4 symbols as an example, we check to see if the input bits are either 111, 111111, 111111111, or 111111111111 at the same time. When the comparison result is false for the first time, the number of used input bits and the number of computed symbols of the position are returned. This case is shown as block 3 in Fig. 5. The detailed architecture of block 3 is similar to those of block 1 and block 2. Unlike [6], we do not check the case of input bits equal to 000 for the run_before core module. Since the run_before core module is located in front of the estimation module, we used to have all the complication for the zeroes left value greater than 6. Now, the run_before core module is not at the front, and therefore, we do not have to worry about the case starting with bits 000. By skipping estimation when the input bits of the run_before core module are 000, we lose performance to a degree. In the proposed algorithm, however, we are able to improve performance with a simpler architecture without skipping the 000. 4. EXPERIMENTAL RESULTS To assess the performance of the proposed algorithm and VLSI architecture, we implemented it in Verilog HDL and synthesized it in TSMC LVT90 process. The result was compared with four previous results: a sequential architecture, a parallel architecture, estimation architecture-1, and estimation architecture-2. The performance and size are the major comparison factors. A sequential architecture is a direct implementation of the official run_before specification in [1]. It is the architecture of run_before core only; therefore, the size is very small. We use the size and performance of the sequential design as the reference of comparison. The parallel architecture of [3] is shown in Fig. 7. It has multiple run_before cores, proportional to the degree of parallelism. Due to various starting points, multiple LUTs are used for each run_before core. The performance of this method is five times faster than the sequential one for the degree of parallelism 4. However, the resulting size is

HIGH-PERFORMANCE ARCHITECTURE OF H.264/AVC CAVLD 603 very large (3640% larger than the sequential one). The estimation architecture-1 of [6] is our previous estimation architecture with third estimation module to estimate all the possible 9 starting points in block 3, as shown in Fig. 8. The performance of this method is 2.34 times faster than the sequential one for degree of parallelism 4. It is much smaller than the complete parallel architecture, but still larger than the estimation architecture-2 (53% larger than the sequential one). Fig. 7. Parallel run_before architecture [3]. Fig. 8. Estimation-2 architecture for run_before [6]. Estimation architecture-2 of [6] is the architecture shown in Fig. 2. It is slightly (9.8%) larger than the proposed architecture, because of the more complicated estimation logic. The performance is worse than the proposed one by 39.3%, because the preceding

604 JONGWOO BAE AND JINSOO CHO run_before core consumes one cycle before the estimation module all the time. In addition, the estimation is skipped when the run_before core has an input starting with 000. The proposed architecture is the smallest, except for the sequential one, and yields the best performance after the completely parallel architecture. Considering the slow performance of the sequential architecture and the extremely large size of the complete parallel architecture, the proposed architecture is the best choice in terms of both performance and size (134% faster and 12% larger than the sequential one for the degree of parallelism 4). Table 1 summarizes the comparison of the five architectures in terms of size and performance relative to those of a sequential method. The degree of parallelism employed for the estimation module is 4, which can be changed easily. Table 1. Comparison of the 5 architectures. Methods Size Performance Sequential [1] 1 1 Parallel [3] 37.4 5 Estimation-1 [6] 1.53 2.34 Estimation-2 [6] 1.23 1.68 Proposed 1.12 2.34 5. CONCLUSION In this paper, we presented a high-performance algorithm and a VLSI architecture to improve the performance of run_before computation for a H.264/AVC CAVLD module. The proposed method employs the parallel estimation of skipping symbols until the skipping fails; then, the real run value is computed. Using this method, we improved the performance by 134%, and the area of run_before increased by 12% by adding the estimation module compared with the sequential implementation for the case of the degree of parallelism 4. Various methods were compared with the proposed one, and the experimental results showed that the proposed method was the best choice, considering the performance and size. The proposed method can be used for a number of modern consumer electronics applications, such as D-TVs and mobile devices that require H.264/ AVC of very high performance. REFERENCES 1. Joint Video Team, Draft ITU-T recommendation and final draft international standard of Joint Video Specification, ITU-T Rec. H.264/ISO/IEC 14496-10 AVC, 2003. 2. H. C. Chang, C. C. Lin, and J. I. Guo, A novel low-cost high-performance VLSI architecture for MPEG-4 AVC/H.264 CAVLC decoding, in Proceedings of IEEE International Symposium on Circuits and Systems, Vol. 6, 2005, pp. 6110-6113. 3. J. Nikara, S. Vassiliadis, J. Takara, and P. Liuha, Multiple-symbol parallel decoding for variable length codes, IEEE Transactions on VLSI Systems, Vol. 12, 2004, pp. 676-685.

HIGH-PERFORMANCE ARCHITECTURE OF H.264/AVC CAVLD 605 4. T. Tsa, D. Fang, and Y. Pan, A hybrid cavld architecture design with low complexity and low power considerations, in Proceedings of IEEE International Conference on Multimedia and Expo, 2007, pp. 1910-1913. 5. H. Lin, Y. Lu, B. Liu, and J. Yang, A highly efficient VLSI architecture for H.264/ AVC CAVLC decoder, IEEE Transactions on Multimedia, Vol. 10, 2008, pp. 31-42. 6. J. Bae, J. Cho, B. Kim, and J. Baek, High performance VLSI design of run_before for H.264/AVC CAVLD, IEICE Electronics Express, Vol. 8, 2011, pp. 950-955. 7. J. Bae and J. Cho, A decoupled architecture for multi-format decoder, IEICE Electronics Express, Vol. 5, 2008, pp. 705-710. 8. J. Bae, N. Park, and S. Lee, Quarter-pel interpolation architecture in H.264/AVC decoder, in Proceedings of IPC, 2007, pp. 224-227. 9. M. Hase, et al., Development of low-power and real-time VC-1/H.264/MPEG-4 video processing hardware, in Proceedings of Design Automation Conference, 2007, pp. 637-643. 10. T. M. Liu, et al., A 125 W, fully scalable MPEG-2 and H.264/AVC video decoder for mobile applications, IEEE Journal of Solid-State Circuits, Vol. 42, 2007, pp. 161-169. Jongwoo Bae received a B.S. degree in Control and Instrumentation from Seoul National University, Korea, in 1988, and M.S. degree and Ph.D. in Computer Engineering from the University of Southern California in 1989 and 1996, respectively. Dr. Bae worked as a hardware and software engineer for Actel, Avanti, and Pulsent Co. in San Jose, CA, from 1996 to 2002. He worked as a principal engineer in the Samsung Electronics System LSI Division from 2003 to 2007. He is currently an Associate Professor in the Department of Information and Communications Engineering, College of Engineering at Myongji University, Korea. His research interests include VLSI design, image/video processing, digital TV, and processors. Jinsoo Cho received a B.S. degree in Electronic Engineering from Inha University, Korea in 1994, an M.S. degree in Electrical Engineering from Columbia University in 1998, and a Ph.D. in Electrical and Computer Engineering from the Georgia Institute of Technology in 2003. From 2004 to 2006, Dr. Cho worked as a senior engineer on the D-TV Development Team of the System LSI Division at Samsung Electronics in Korea. He is currently an Associate Professor in the Department of Computer Engineering, College of Information Technology at Gachon University in Korea. His research interests include image/video processing, digital TV, image/video enhancement, and image/video compression.