A 4-way parallel CAVLC design for H.264/AVC 4 Kx2 K 60 fps encoder

Similar documents
FPGA based High Performance CAVLC Implementation for H.264 Video Coding

NEW CAVLC ENCODING ALGORITHM FOR LOSSLESS INTRA CODING IN H.264/AVC. Jin Heo, Seung-Hwan Kim, and Yo-Sung Ho

High Performance VLSI Architecture of Fractional Motion Estimation for H.264/AVC

Design of a High Speed CAVLC Encoder and Decoder with Parallel Data Path

Signal Processing: Image Communication

Architecture of High-throughput Context Adaptive Variable Length Coding Decoder in AVC/H.264

Fast Decision of Block size, Prediction Mode and Intra Block for H.264 Intra Prediction EE Gaurav Hansda

Design of Entropy Decoding Module in Dual-Mode Video Decoding Chip for H. 264 and AVS Based on SOPC Hong-Min Yang, Zhen-Lei Zhang, and Hai-Yan Kong

High-Performance VLSI Architecture of H.264/AVC CAVLD by Parallel Run_before Estimation Algorithm *

Optimized architectures of CABAC codec for IA-32-, DSP- and FPGAbased

An Efficient Table Prediction Scheme for CAVLC

A COMPARISON OF CABAC THROUGHPUT FOR HEVC/H.265 VS. AVC/H.264. Massachusetts Institute of Technology Texas Instruments

Reducing/eliminating visual artifacts in HEVC by the deblocking filter.

A COST-EFFICIENT RESIDUAL PREDICTION VLSI ARCHITECTURE FOR H.264/AVC SCALABLE EXTENSION

H.264/AVC BASED NEAR LOSSLESS INTRA CODEC USING LINE-BASED PREDICTION AND MODIFIED CABAC. Jung-Ah Choi, Jin Heo, and Yo-Sung Ho

A comparison of CABAC throughput for HEVC/H.265 VS. AVC/H.264

Low power context adaptive variable length encoder in H.264

Efficient VLSI Huffman encoder implementation and its application in high rate serial data encoding

Video Codecs. National Chiao Tung University Chun-Jen Tsai 1/5/2015

Video Coding Using Spatially Varying Transform

Fast frame memory access method for H.264/AVC

A High-Performance Parallel CAVLC Encoder on a Fine-Grained Many-Core System

Adaptive Entropy Coder Design Based on the Statistics of Lossless Video Signal

THE latest video coding standard, H.264/advanced

Chapter 2 Joint MPEG-2 and H.264/AVC Decoder

AN ENHANCED CONTEXT-ADAPTIVE VARIABLE LENGTH CODING FOR H.264.

A full-pipelined 2-D IDCT/ IDST VLSI architecture with adaptive block-size for HEVC standard

A 1080p H.264/AVC Baseline Residual Encoder for a Fine-Grained Many-Core System Zhibin Xiao, Student Member, IEEE, and Bevan M. Baas, Member, IEEE

A Quantized Transform-Domain Motion Estimation Technique for H.264 Secondary SP-frames

AVC Entropy Coding for MPEG Reconfigurable Video Coding

Single-Port SRAM-Based Transpose Memory With Diagonal Data Mapping for Large Size 2-D DCT/IDCT

A Novel Deblocking Filter Algorithm In H.264 for Real Time Implementation

A VLSI Architecture for H.264/AVC Variable Block Size Motion Estimation

H.264 / AVC (Advanced Video Coding)

Reconfigurable Variable Block Size Motion Estimation Architecture for Search Range Reduction Algorithm

LIST OF TABLES. Table 5.1 Specification of mapping of idx to cij for zig-zag scan 46. Table 5.2 Macroblock types 46

BANDWIDTH-EFFICIENT ENCODER FRAMEWORK FOR H.264/AVC SCALABLE EXTENSION. Yi-Hau Chen, Tzu-Der Chuang, Yu-Jen Chen, and Liang-Gee Chen

THE H.264, the newest hybrid video compression standard

Smart Bus Arbiter for QoS control in H.264 decoders

Real-time and smooth scalable video streaming system with bitstream extractor intellectual property implementation

H.264/AVC und MPEG-4 SVC - die nächsten Generationen der Videokompression

Lec 04 Variable Length Coding in JPEG

A Motion Vector Predictor Architecture for AVS and MPEG-2 HDTV Decoder

DUE to the high computational complexity and real-time

VHDL Implementation of H.264 Video Coding Standard

IMPROVED CONTEXT-ADAPTIVE ARITHMETIC CODING IN H.264/AVC

High Efficiency Video Decoding on Multicore Processor

Yibo Fan, Leilei Huang, Yufeng Bai, and Xiaoyang Zeng, Member, IEEE

Performance Comparison between DWT-based and DCT-based Encoders

High-Throughput Parallel Architecture for H.265/HEVC Deblocking Filter *

High-Throughput Entropy Encoder Architecture for H.264/AVC

EE Low Complexity H.264 encoder for mobile applications

Advanced Video Coding: The new H.264 video compression standard

Reduced 4x4 Block Intra Prediction Modes using Directional Similarity in H.264/AVC

Deblocking Filter Algorithm with Low Complexity for H.264 Video Coding

OVERVIEW OF IEEE 1857 VIDEO CODING STANDARD

Design and Implementation of 3-D DWT for Video Processing Applications

Title page to be provided by ITU-T ISO/IEC TABLE OF CONTENTS

Xin-Fu Wang et al.: Performance Comparison of AVS and H.264/AVC 311 prediction mode and four directional prediction modes are shown in Fig.1. Intra ch

An HEVC Fractional Interpolation Hardware Using Memory Based Constant Multiplication

A LOSSLESS INDEX CODING ALGORITHM AND VLSI DESIGN FOR VECTOR QUANTIZATION

SELECTIVE ENCRYPTION OF C2DVLC OF AVS VIDEO CODING STANDARD FOR I & P FRAMES. Z. SHAHID, M. CHAUMONT and W. PUECH

FPGA Implementation of High Performance Entropy Encoder for H.264 Video CODEC

An Efficient Mode Selection Algorithm for H.264

Adaptation of Scalable Video Coding to Packet Loss and its Performance Analysis

A NOVEL SCANNING SCHEME FOR DIRECTIONAL SPATIAL PREDICTION OF AVS INTRA CODING

Multimedia Decoder Using the Nios II Processor

Lossless Frame Memory Compression with Low Complexity using PCT and AGR for Efficient High Resolution Video Processing

International Journal of Emerging Technology and Advanced Engineering Website: (ISSN , Volume 2, Issue 4, April 2012)

Fast CAVLD of H.264/AVC on bitstream decoding processor

H.264 to MPEG-4 Transcoding Using Block Type Information

Chapter 10. Basic Video Compression Techniques Introduction to Video Compression 10.2 Video Compression with Motion Compensation

ALGORITHM AND ARCHITECTURE DESIGN FOR INTRA PREDICTION IN H.264/AVC HIGH PROFILE

Huffman Coding Author: Latha Pillai

A SCALABLE COMPUTING AND MEMORY ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION ON FIELD-PROGRAMMABLE GATE ARRAYS. Theepan Moorthy and Andy Ye

A Single-Issue DSP-based Multi-standard Media Processor for Mobile Platforms

H.264 STREAM REPLACEMENT WATERMARKING WITH CABAC ENCODING

An H.264/AVC Main Profile Video Decoder Accelerator in a Multimedia SOC Platform

MPEG RVC AVC Baseline Encoder Based on a Novel Iterative Methodology

Improving the quality of H.264 video transmission using the Intra-Frame FEC over IEEE e networks

FPGA Implementation of Rate Control for JPEG2000

Transcoding from H.264/AVC to High Efficiency Video Coding (HEVC)

Interframe coding A video scene captured as a sequence of frames can be efficiently coded by estimating and compensating for motion between frames pri

A LOW-COMPLEXITY AND LOSSLESS REFERENCE FRAME ENCODER ALGORITHM FOR VIDEO CODING

Jun Zhang, Feng Dai, Yongdong Zhang, and Chenggang Yan

Video compression with 1-D directional transforms in H.264/AVC

White paper: Video Coding A Timeline

WITH the growth of the transmission of multimedia content

EFFICIENT PU MODE DECISION AND MOTION ESTIMATION FOR H.264/AVC TO HEVC TRANSCODER

An Efficient Hardware Architecture for H.264 Transform and Quantization Algorithms

Context adaptive variable length decoding optimization and implementation on tms320c64 dsp for h.264/avc

An Efficient Intra Prediction Algorithm for H.264/AVC High Profile

IBM Research Report. Inter Mode Selection for H.264/AVC Using Time-Efficient Learning-Theoretic Algorithms

Mark Kogan CTO Video Delivery Technologies Bluebird TV

System-on-Chip Design Methodology for a Statistical Coder

Implementation and analysis of Directional DCT in H.264

A lifting wavelet based lossless and lossy ECG compression processor for wireless sensors

ERROR-ROBUST INTER/INTRA MACROBLOCK MODE SELECTION USING ISOLATED REGIONS

/$ IEEE

Introduction of Video Codec

Transcription:

A 4-way parallel CAVLC design for H.264/AVC 4 Kx2 K 60 fps encoder Huibo Zhong, Sha Shen, Yibo Fan a), and Xiaoyang Zeng State Key Lab of ASIC and System, Fudan University 825 Zhangheng Road, Shanghai, 201203, China a) fanyibo@fudan.edu.cn Abstract: This paper presents a high performance design for Context- Based Adaptive Variable Length Coding (CAVLC) used in the H.264/ AVC standard. A two-stage encoder is proposed to make the scan and encode stage work simultaneously. The scan engine scans four coefficients at each cycle. Parallel encoder for four levels and parallel encoder for four Run before are adopted to accelerate the encode engine. Only 120 cycles at most are needed to process one MB. The proposed CAVLC encoder can support 4 Kx2 K@60 fps (frame per second) real-time encoding at 250 MHz and the gate count is about 32 k. Keywords: H.264/AVC, entropy coding, CAVLC, level coding Classification: Integrated circuits References [1] Joint Video Team (JVT) of ISO/IEC MPEG and ITU-T VCEG, Draft ITU-T recommendation and final draft international standard of joint video specification (ITU-T Rec. H.264-ISO/IEC 14496-10 AVC), March 2003. [2] G. Bjontegaardand and K. Lillevold, Context-adaptive VLC (CVLC) coding of coefficients, JVT Document JVT-C028, Fairfax, VA, 2002. [3] T. C. Chen, Y. W. Huang, C. Y. Tsai, B. Y. Hsieh, and L. G. Chen, Architecture design of context-based adaptive variable-length coding for H.264/AVC, IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 53, no. 9, pp. 832 836, Sept. 2006. [4] Y. Yi and B. C. Song, High-speed CAVLC encoder for 1080p 60-Hz H.264 codec, IEEE Signal Process. Lett., vol. 15, pp. 891 894, 2008. [5] S. C. Hsia and W. S. Liao, Fowward Computations for Context-Adaptive Variable-Length Coding Design, IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 57 no. 8, pp. 637 641, Aug. 2010. 1 Introduction H.264/AVC [1] is the latest international video coding standard. Two kinds of entropy coding methods are specified in H.264/AVC: Context-Based Adaptive Variable Length Coding (CAVLC) and Context-based Binary Arithmetic 1863

Coding (CABAC). Due to the context-adaptive feature, CAVLC yields a higher coding efficiency compared with conventional VLC coding. However, this context-adaptive feature introduces many data dependencies such as: 1). The symbols TotalCoeff, TrailingOne, Sign trail, Level, TotalZero and Run before cannot be obtained until the statistical process of each 4x4 block is finished [1, 2], which makes the scan and encode stage hard to be paralleled. 2). The bit-serial encoding process for all the symbols above makes it difficult to be speeded up by increasing parallelism or pipelining. 3). To encode TotalCoeff, nc should be determined at first, however, the value of nc is calculated according to the upper and left coded 4x4 block. 4). There is seven VLC tables for the levels in each 4x4 block where the table selection depends on the previous table number and the magnitude of the successive encoded Level. Because of the data dependencies in CAVLC, it is hard to propose a CAVLC encoder for high resolution application such as 4 Kx2 K or larger. Typically, there are two methods to improve the speed of CAVLC encoder. One is to increase the operating frequency, and the other is to reduce the cycles of processing one MB. To increase the operating frequency, we can insert more pipeline stage. A two-stage architecture is adopted by many previous works to make the scan and encode work simultaneously. The general approach to cut down the cycles is to increase the parallelism. Varieties of VLSI architectures for CAVLC are proposed in [3, 4, 5]. Chen et al. [3] adopted a two-stage stage structure, but parallelism is not implemented in their work, which makes it require at most 500 cycles to process one MB and greatly restrict the throughput. A parallel structure for Luma (Y) and Chroma (Cr and Cr) encoding is proposed in [4], however, this parallel approach introduces large hardware cost. Yi et al. [5] also employed the twostage structure. In addition, it scanned two coefficients at one cycle by using the parallel strategy. The cycles consumed by the scan and encode engine is not balanced. It needs NC + 4 cycles to process one 4x4 block where NC is equal to the nonzero coefficients in one 4x4 block. Once the NC becomes larger due to the bit rate, the cycles using for encode engine will increase. In this paper, we focus on increasing the hardware parallelism for CAVLC. A two-stage architecture is also proposed in this paper. A paralleled scan engine which can scan four coefficients at one cycle is proposed. In the same way, a paralleled Level encoder is proposed to accelerate the speed of encoder engine. Four Run before symbols are encoded simultaneously with Level. With the methods above, only four cycles are needed by each stage of the CAVLC. Only 120 cycles at most are needed to process one MB. 2 Proposed architecture Fig. 1 shows the proposed two-stage CAVLC encoder. The whole CAVLC encoder is comprised of scan engine and encode engine. In the scan stage, the scan address is generated by the scan engine according to the value of MB type and CBP code. The residual coefficients are fetched from the residc IEICE 2011 1864

ual buffer in the backward order. Then the scan engine counts the required statistics such as TrailingOne, TotalCoeff, TotalZero, and so on. The runlevel symbols are extracted by the run-level detector and stored in the statistic buffer. The symbols obtained by the scan engine are encoded by the engine by looking up corresponding tables. In our work, the two stage of the CAVLC is well designed to consume the same cycles for each 4x4 block. To achieve the highest throughput, a scan engine which scans four coefficients at one cycle is proposed. It takes only four cycles to finish the parameter calculation of one 4x4 block. To achieve the same speed of the scan engine, some schemes have to be adopted to speed up the encode engine. The detail of the two engines is presented in the rest of this section. Fig. 1. Proposed CAVLC architecture 2.1 Scan engine Scan engine is used to figure out the CAVLC parameters. There are totally 384 coefficients (for YUV = 4 : 2 : 0) inside one MB, so scan stage may become a bottleneck. A paralleled scan engine which can scan four coefficients at one cycle is proposed to raise the speed of scan engine. Firstly, eight state bits can be computed as follows: if coeff 0 =0, s0=0 else s0 =1 if coeff 1 =0, s1=0 else s1 =1 if coeff 2 =0, s2=0 else s2 =1 if coeff 3 =0, s3=0 else s3 =1 (1) 1865

if coeff 0 = ±1, c0=1 else c0 =0 if coeff 1 = ±1, c1=1 else c1 =0 if coeff 2 = ±1, c2=1 else c2 =0 if coeff 3 = ±1, c3=1 else c3 =0 As shown in (1) and (2), s0 s3 are the state bits indicating whether the coefficient equals to zero. c0 c3 are the state bits indicating whether the coefficient equals to ±1. By the bits of these eight states, all required CAVLC symbols can be figured out. By judging the value of s0 s3, TotalCoeff, TotalZero, Run before can be obtained. To figure out TrailingOne, all the eight bits should be considered. To lower the design complexity, the four coefficients are divided into two parts: the low and the high. Each part is collected independently. Their results are summarized at the end. The detail of the architecture is shown in Fig. 2 (a). (2) Fig. 2. Detail architecture of proposed CAVLC: (a) Parallel architecture for parameter computing. (b) Proposed parallel Level encoder. (c) Proposed parallel Run before encoder. (d) Timing diagram of the Bitstream Packer 2.2 Encode engine The encode engine encodes the data which are obtained by the scan engine. It is comprised of Level encoder, Run before encoder, TotalCoeff and Trailinc IEICE 2011 1866

gone encoder, TotalZero encoder as shown in Fig. 1. TotalCoeff, TrailingOne and TotalZero are coded by looking up corresponding tables. The Level encoder and Run before encoder are the key of the encode engine. In H.264/AVC standard, seven VLC tables are used to define the codeword of Level symbols. The number of execution cycles required for encoding Level symbols depends on the number of nonzero coefficients in one 4x4 block. There may be at most 16 levels inside one 4x4 block. Therefore, the number of cycles required in coding stage may exceed that in scanning stage. Many previous works encode the Level symbols in sequential because the evaluation of each coefficient level depends on vlc, which denotes the index of the VLC table and is derived from the previously coded coefficient level as pseudo code (3) shows. Let incvlc {0, 3, 6, 12, 24, 48, 65535}. To achieve higher throughput, parallel Level encoding is necessary. The limit of data dependency can be effectively eliminated by using the vlc-look-ahead technique so that coefficients can be processed simultaneously. if (vlc =0) vlc ++; if (abs(level) > incvlc [vlc]) vlc ++; (3) Fig. 2 (b) shows the architecture of proposed Level coding element with VLCN look-ahead. In the proposed design, four Level symbols are fetched from the Level Register Buffer simultaneously, and four Level encoders are used for parallel encoding. Noted that the encode engine will consume the same cycles as the scan engine does if this structure is used. The controller of the Level encoder controls the four Level encoders according to the value of TotalCoeffs and TrailingOnes. VLCN look-ahead computes the four vlc codes for the corresponding level encoder according to the absolute value of the Level and previous vlc. The four Level encoders then encode the Level and generate level prefix and level suffix. And finally, the Level codeword packer packs all the four pairs of level prefix and level suffix into FourLevelCode. In the same way, run before encoder should also be processed parallel as the Level encoder does. Four Run before symbols are fetched from Run before buffer. Four Run before codewords are produced by looking up the Run Before&Zerosleft table. These codewords are packed with the TotalZero Code and stored in a register until all the levels are encoded. The detail of the Run before encoder is shown in Fig. 2 (c). Then stored codeword is called TotalZero/Run before code. 2.3 Bitstream packer The codeword generated by the encoder should be packed into bitstream by the Bitstream Packer. As it has been noted, all the codeword should be packed within four cycles, so the timing of the Bitstream packer should be carefully designed. The detail timing diagram of the bitstream packer is illustrated in Fig. 2 (d). At Enc cycle0 (the first cycle of the encode stage), the TotalCoeff code is packed with the first FourLevelCode. The TotalCoeff 1867

code can be obtained by looking up the VLC tables according to the value of TotalCoeff and TrailingOne. The FourLevelCode is generated as Fig. 2 (b) shows. At the following two cycles (Enc cycle1 and Enc cycle2), the next two FourLevelCode are packed. The TotalZero/Run before code which is generated as Fig. 2 (c) shows is packed with the last FourLevelCode at Enc cycle3 (the last cycle of encode stage). 3 Performance analysis and comparison The proposed architecture has been implemented in Verilog HDL, synthesized using SMIC 0.13 um standard CMOS technology. The circuit occupies about 32 k gates at a core speed of 250 MHz. The comparison on hardware cost and processing speed of the proposed design with the existing design [3, 4, 5] is as Table I shows. From Table I, we can see that the throughput of the proposed design is 120 cycles/mb, which is smaller than all [3, 4, 5] does. In practice, with the method using the coded block pattern (CBP) to determine whether the blocks need to be encoded or not, the average cycles will be smaller than the worst case. It takes about 90 cycles to encode one MB at average according to our experimental results. The hardware cost of our design is 32 k gate, which is larger than [3] and [5]. This is because more hardware parallelism is used in our design, such as four level encoders and four Run before encoders. Table I. Comparison of the proposed design with others In order to perform a fair comparison of various CAVLC design, a metric called Design Efficiency as defined in (4) is introduced. It jointly considers the pixels of Max. Spec (Pixel), video Format (format), gate count (Gate) and operating Frequency (frequency). The format is given by (5). The normalize Design Efficiency is illustrated in Table I. We can see that our design achieves the maximum design Efficiency. Moreover, our work can process 4 Kx2 K@60 fps video in real time. Design Efficiency = Pixel Format Gate Frequency (4) 1868

1.5, YUV =4:2:0 Format = 2, YUV =4:2:2 3, YUV =4:4:4 (5) 4 Conclusion This paper presents a high-throughput CAVLC encoder for the H.264/AVC system. Some parallel schemes are used to reduce the required cycles of processing one MB. With those schemes, a new CAVLC encoder architecture is proposed. Compared with other previous methods, this architecture has the following advantage: 1) it greatly reduces the cycles to process one MB, at most only 120 cycles are needed. 2) It achieves high coding speed with a comparatively less hardware cost. This high-performance CAVLC encoder can process 4 Kx2 K@60 fps video in real time. Acknowledgments This work is supported by State Key Lab of ASIC & System, Fudan University. Project Number: 11MS004. 1869