MPEG RVC AVC Baseline Encoder Based on a Novel Iterative Methodology

Similar documents
AVC Entropy Coding for MPEG Reconfigurable Video Coding

Design of an Embedded Low Complexity Image Coder using CAL language

RVC-CAL DATAFLOW IMPLEMENTATIONS OF MPEG AVC/H.264 CABAC DECODING. 2 IETR/INSA Rennes F-35708, Rennes, France

Optimized architectures of CABAC codec for IA-32-, DSP- and FPGAbased

EFFICIENT DATA FLOW VARIABLE LENGTH DECODING IMPLEMENTATION FOR THE MPEG RECONFIGURABLE VIDEO CODING FRAMEWORK

Multimedia Decoder Using the Nios II Processor

FPGA Implementation of High Performance Entropy Encoder for H.264 Video CODEC

FPGA based High Performance CAVLC Implementation for H.264 Video Coding

A unified hardware/software co-synthesis solution for signal processing systems

THE H.264 ADVANCED VIDEO COMPRESSION STANDARD

A Dedicated Hardware Solution for the HEVC Interpolation Unit

Exploring the Design Space of HEVC Inverse Transforms with Dataflow Programming

Synthesizing Hardware from Dataflow Programs

A LOW-COMPLEXITY AND LOSSLESS REFERENCE FRAME ENCODER ALGORITHM FOR VIDEO CODING

Introduction to Reconfigurable Video Coding. Euee S. Jang Hanyang University

Synthesizing hardware from dataflow programs: An MPEG-4 simple profile decoder case study

Method For Efficient Hardware Implementation From RVC-CAL Dataflow: A LAR. Coder baseline Case Study

Computing in the age of parallelism: Challenges and opportunities

OpenDF A Dataflow Toolset for Reconfigurable Hardware and Multicore Systems

Fast Decision of Block size, Prediction Mode and Intra Block for H.264 Intra Prediction EE Gaurav Hansda

Automated Design Flow for Coarse-Grained Reconfigurable Platforms: an RVC-CAL Multi-Standard Decoder Use-Case

Design of a High Speed CAVLC Encoder and Decoder with Parallel Data Path

EFFICIENT DEISGN OF LOW AREA BASED H.264 COMPRESSOR AND DECOMPRESSOR WITH H.264 INTEGER TRANSFORM

H.264 / AVC Context Adaptive Binary Arithmetic Coding (CABAC)

High-Throughput Entropy Encoder Architecture for H.264/AVC

Architecture of High-throughput Context Adaptive Variable Length Coding Decoder in AVC/H.264

An HEVC Fractional Interpolation Hardware Using Memory Based Constant Multiplication

An LLVM-based decoder for MPEG Reconfigurable Video Coding

A 4-way parallel CAVLC design for H.264/AVC 4 Kx2 K 60 fps encoder

International Journal of Emerging Technology and Advanced Engineering Website: (ISSN , Volume 2, Issue 4, April 2012)

Just-in-time adaptive decoder engine: a universal video decoder based on MPEG RVC

Multi-level Design Methodology using SystemC and VHDL for JPEG Encoder

H.264/AVC und MPEG-4 SVC - die nächsten Generationen der Videokompression

Advanced Video Coding: The new H.264 video compression standard

Implementation of A Optimized Systolic Array Architecture for FSBMA using FPGA for Real-time Applications

Software Code Generation for the RVC-CAL Language

PERFORMANCE BENCHMARKING OF RVC BASED MULTIMEDIA SPECIFICATIONS

Context based optimal shape coding

Lecture 13 Video Coding H.264 / MPEG4 AVC

High Performance VLSI Architecture of Fractional Motion Estimation for H.264/AVC

VHDL Implementation of H.264 Video Coding Standard

Recent, Current and Future Developments in Video Coding

H.264 / AVC (Advanced Video Coding)

Aiyar, Mani Laxman. Keywords: MPEG4, H.264, HEVC, HDTV, DVB, FIR.

HIGH LEVEL SYNTHESIS OF SMITH-WATERMAN DATAFLOW IMPLEMENTATIONS

High-Performance VLSI Architecture of H.264/AVC CAVLD by Parallel Run_before Estimation Algorithm *

Reconfigurable media coding: a new specification model for multimedia coders

Orcc: multimedia development made easy

A SCALABLE COMPUTING AND MEMORY ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION ON FIELD-PROGRAMMABLE GATE ARRAYS. Theepan Moorthy and Andy Ye

Buffer Dimensioning for Throughput Improvement of Dynamic Dataflow Signal Processing Applications on Multi-Core Platforms

In Proceedings of the Conference on Design and Architectures for Signal and Image Processing, Edinburgh, Scotland, October 2010.

ISSCC 2006 / SESSION 22 / LOW POWER MULTIMEDIA / 22.1

Using animation to motivate motion

An Efficient Hardware Architecture for H.264 Transform and Quantization Algorithms

Published in: Proceedings of the 45th Annual Asilomar Conference on Signals, Systems, and Computers

Research Article Automatic Generation of Optimized and Synthesizable Hardware Implementation from High-Level Dataflow Programs

Low power context adaptive variable length encoder in H.264

High-level dataflow design of signal processing systems for reconfigurable and multicore heterogeneous platforms

Context-Adaptive Binary Arithmetic Coding with Precise Probability Estimation and Complexity Scalability for High- Efficiency Video Coding*

AN ENHANCED CONTEXT-ADAPTIVE VARIABLE LENGTH CODING FOR H.264.

Huffman Coding Author: Latha Pillai

Interframe coding A video scene captured as a sequence of frames can be efficiently coded by estimating and compensating for motion between frames pri

A COST-EFFICIENT RESIDUAL PREDICTION VLSI ARCHITECTURE FOR H.264/AVC SCALABLE EXTENSION

A VLSI Architecture for H.264/AVC Variable Block Size Motion Estimation

2014 Summer School on MPEG/VCEG Video. Video Coding Concept

FPGA Provides Speedy Data Compression for Hyperspectral Imagery

Lecture 14, Video Coding Stereo Video Coding

Lecture 5: Error Resilience & Scalability

Research Article A High-Throughput Hardware Architecture for the H.264/AVC Half-Pixel Motion Estimation Targeting High-Definition Videos

Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks

Digital Video Processing

Federal University of Pelotas UFPel Group of Architectures and Integrated Circuits Pelotas Brasil

PREFACE...XIII ACKNOWLEDGEMENTS...XV

A COMPARISON OF CABAC THROUGHPUT FOR HEVC/H.265 VS. AVC/H.264. Massachusetts Institute of Technology Texas Instruments

High-Throughput Parallel Architecture for H.265/HEVC Deblocking Filter *

Video Coding Using Spatially Varying Transform

ISO/IEC INTERNATIONAL STANDARD. Information technology MPEG video technologies Part 4: Video tool library

Overview. CSE372 Digital Systems Organization and Design Lab. Hardware CAD. Two Types of Chips

N RISCE 2K18 ISSN International Journal of Advance Research and Innovation

High Efficiency Video Coding. Li Li 2016/10/18

generation for the MPEG Reconfigurable Video Coding framework: From CAL actions to C functions.

Design and Implementation of 5 Stages Pipelined Architecture in 32 Bit RISC Processor

Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA

An Efficient Mode Selection Algorithm for H.264

HIGH-PERFORMANCE RECONFIGURABLE FIR FILTER USING PIPELINE TECHNIQUE

Chapter 11.3 MPEG-2. MPEG-2: For higher quality video at a bit-rate of more than 4 Mbps Defined seven profiles aimed at different applications:

4G WIRELESS VIDEO COMMUNICATIONS

MultiFrame Fast Search Motion Estimation and VLSI Architecture

Video coding. Concepts and notations.

New Motion Estimation Algorithms and its VLSI Architectures for Real Time High Definition Video Coding

FPGA Implementation of Low Complexity Video Encoder using Optimized 3D-DCT

FPGA Based Low Area Motion Estimation with BISCD Architecture

Improved Context-Based Adaptive Binary Arithmetic Coding in MPEG-4 AVC/H.264 Video Codec

A Library of Parameterized Floating-point Modules and Their Use

POWER CONSUMPTION AND MEMORY AWARE VLSI ARCHITECTURE FOR MOTION ESTIMATION

Video Codecs. National Chiao Tung University Chun-Jen Tsai 1/5/2015

An FPGA Implementation of the Fast Minimum- Redundancy Prefix Coding

Scalable Coding of Image Collections with Embedded Descriptors

A Methodology for Profiling and Partitioning Stream Programs on Many-core Architectures

Hardware Architecture Design of Video Compression for Multimedia Communication Systems

Transcription:

MPEG RVC AVC Baseline Encoder Based on a Novel Iterative Methodology Hussein Aman-Allah, Ehab Hanna, Karim Maarouf, Ihab Amer Laboratory of Microelectronic Systems (GR-LSM), EPFL CH-1015 Lausanne, Switzerland {hussein.aman-allah, ehab.hanna, karim.maarouf, ihab.amer}@epfl.ch Abstract With the emergence of new generations of multicore architectures, the need for efficient multimedia algorithm implementations has become critical. This paper describes a new methodology for efficient implementations of algorithms targeting reconfigurable architectures. The Reconfigurable Video Coding (RVC) standard aims to provide a framework allowing a dynamic development, implementation and adoption of standardized video coding solutions with features of higher flexibility and reusability. RVC-CAL actor language is a dataflow language that makes better use of the multicore and parallel architectures. The proposed design flow methodology follows an iteration-based implementation model rather than the traditional sequential model. Analysis, design, development, simulation, testing and adaptation are performed with every iteration ending up with a functional version of the algorithm. A case study is conducted to illustrate the productivity of the proposed methodology in which the implementation of an AVC baseline encoder on a Xilinx Virtex 5 XC5VLX50T FPGA demonstrated for intra prediction architecture search space coexploration. 1. Introduction The increasing complexity of video coding standards has led to the development of the platforms corresponding to these standards and their evolution into multicore and multiple component parallel architectures. Nevertheless, algorithms are still being specified in the same monolithic sequential models, mostly in C/C++ that are unable to exploit the full capabilities featured by the advances in the target architectures. Thus, the MPEG RVC standard while addressing the reusability and reconfigurability of the different MPEG standards, defines RVC-CAL to be the normative language of its video tool library (VTL). The nature of RVC-CAL as a dataflow language allows it to exploit the parallelism offered by the new platforms while at the same time allowing for the reusability and reconfigurability of the FUs provided in the RVC VTL. Simulating parallelism is one of the main advantages of using dataflow programming over sequential programming, making it more suitable for implementations targeting multicore architectures [1]. Unlike C/C++ implementation where parallelism has to be explicitly defined; the developer has more time to focus on algorithm-specific details and architecturerelated optimizations. Nevertheless, the inconsistency arising from starting with an RVC-CAL description and following the sequential implementation methodology often leads to unpredictable results. This paper presents an iterative design flow methodology for the implementation of efficient reconfigurable architectures as opposed to sequential models that do not suit parallel architectures, supported by a case study of an MPEG RVC AVC baseline encoder [2]. The paper is structured as follows: Section II presents the proposed methodology. In section III, the case study covering AVC inter prediction, intra prediction and entropy coding modules is examined to demonstrate this methodology. The modeling and synthesis results are presented and analyzed in section IV. Finally, section V concludes the paper. 2. An iterative methodology The proposed methodology recommends a design flow, along with tools and practices that when applied precisely on a given algorithm, results into an optimized implementation of the target architecture. Given an algorithm to be implemented, first the appropriate FUs are selected from the VTL. An RVC- CAL description is then provided to assemble the FUs and reconfigure them by adding new functionalities that are not yet provided by the VTL. Test cases are

fed to the RVC-CAL description to verify it against the standardized reference SW. Once the RVC-CAL description is validated, the CAL2HDL tool performs resource scheduling and optimizations that result in an efficient HDL description [3] 0. A test bench is provided to verify that the generated HDL functionality sustains the one provided by the RVC-CAL description in the previous step. In case of any anomalies, a revision of the RVC- CAL description is performed to make sure that the definitions are unambiguous for the CAL2HDL tool. Optimizations are possibly applied to the RVC-CAL description to allow for the generation of HDL code of better exploitation for the available hardware resources, based on observations of the previously generated HDL code. The next step is to synthesize the generated HDL code into hardware implementation targeting a specific architecture. The output is simulated and examined. If after the mapping, and routing performed by the synthesis tool the implementation does not meet the target architectural specifications, in terms of resource consumption, throughput, frequency or other parameters, a revision of the RVC-CAL description is performed for optimizations. When the synthesized implementation meets its target architecture design goals, the process terminates and the implementation is ready. The main advantage provided from the proposed design flow and the associated tools is that several iterations of the steps described in Figure 1 can be performed in considerably less time than the traditional sequential models for several reasons. Figure 1 - The proposed design flow First, the bottlenecks and anomalies are detected early in the process, because of RVC-CAL s ability to simulate parallel and multicore architectures. Second, the implementation passes through several optimizations on different levels. Third, the modifications are always performed on the RVC-CAL level. Thus, it needs less time and frequent architectural adaptations are feasible because of the strong abstraction and encapsulation properties exhibited by CAL. The proposed methodology allows for exploration of the resources/throughput search space. A set of transitions are explored until the implementation meets the algorithm/architecture design goals [4]. The initial RVC-CAL description is usually a direct implementation of the algorithm that aims at creating a starting point for simulation. The synthesis result may not meet the target specifications, since no optimization iterations have been performed. Several optimization cycles are performed which moves the implementation closer and closer to the target specifications. 3. RVC AVC baseline encoder case study The proposed methodology along with the tools and practices are applied to a case study of an MPEG RVC AVC baseline profile encoder. The aim of the case study is to demonstrate the contribution of the proposed approach on the efficiency of the implementation. 3.1. Inter prediction The AVC Inter prediction module is used to estimate the motion occurring between frames. This module aims at removing the temporal redundancy that exists between frames in order to achieve a high compression of the video sequence 0[5]. Figure 2 shows the inter prediction RVC-CAL Network. The two main steps in inter prediction are motion estimation and motion compensation. The motion estimation calculates the motion between frames via full search macro-block matching and returns a motion vector describing the displacement between the current block and the best matching reference block. The motion compensation uses the current block and the best matching reference block to calculate the residual error. The residual error is then passed to the transform coding module to be compressed, reconstructed and finally sent back to the motion compensation module to reconstruct the current

block. The motion compensation module then stores the reconstructed blocks in the memory to be used as a reference for the next frame to be encoded. Figure 3 - Intra prediction RVC-CAL network Figure 2 Inter prediction RVC-CAL network As shown in Figure 2, the video reader reads individual frames from a video sequence and either passes them directly to the memory as P-frames or to the intra prediction module, which first encodes and reconstructs the I-frame before storing it in the memory. Finally, inter prediction calculates the differential motion vectors based on the difference between the actual and predicted motion vectors [6]. The modules are moderated by the inter prediction control unit (IPCU). 3.2. Intra prediction In intra prediction, predictors are formed from previously encoded and reconstructed Macro Blocks (MBs). In AVC luma prediction, predictors can be produced either for every 4x4 block in a MB (9 modes) or for the MB as a whole (4 modes). The encoder selects the mode which minimizes the difference between the predictor and the block to be encoded [6]. Figure 3 [7] shows the RVC-CAL network for an intra prediction module. The design is based on [7], where reconfigurable processing elements (PEs) are used to generate predictors for all the modes instead of one PE for each mode. There are four PEs, which generate four predictors in parallel. Using this four-parallel architecture achieves a good tradeoff between throughput and resource usage. Each PE consists of a series of Adders, Registers and Shifters which calculate the value of the predictor. The inputs to the PEs are determined by the main actor in the network: the PE Controller. It selects the suitable inputs depending on the prediction mode and the position of the sample to be predicted. A more detailed explanation of this module can found in [8]. 3.3. Entropy coding The AVC baseline profile uses Exp-Golomb to encode all syntax elements except for the residual data that are encoded using Context Adaptive Variable Length Coding (CAVLC). Exp-Golomb codes are variable-length binary codes that are constructed systematically as thoroughly explained in [9]. Figure 4 [2] shows the RVC-CAL network implementing the Exp-Golomb.

4. Results and analysis Figure 4 - Exp-Golomb RVC-CAL network The Exp-Golomb module shown in Figure 4 is implemented in RVC-CAL as a simple network with one input port, which receives the parameter token to be encoded and one output port which outputs the codeword serially. The Exp-Golomb network connects nine different actors, out of which four actors perform the tasks of the four different mapping modes. A controller moderates the mapping actors and performs the mapping mode decision. The different mappings can be executed in parallel since a different actor is responsible for each. Thus, with pipelining at the controller, and parallel execution at the mapping actors, the throughput improves significantly. Similarly, the RVC-CAL CAVLC model shown in Figure 5 [2] is implemented with slightly more actors. Since CAVLC switches different VLC look up tables (LUTs) to encode the different syntax elements based on values of previously coded elements. Each LUT exploits the statistical properties of the syntax elements it is designed to encode. Accessing the LUTs can be sometimes performed in parallel and the different actors execute and prepare their output independent of the sequential flow of the algorithm. The Assembler actor is then responsible for compiling the output tokens from the different actors and producing the encoded stream serially. The results of applying the proposed methodology to the RVC encoder can be presented on two levels; the modeling level including the development efforts consumed and the productivity gained, and the implementation level in which the RVC-CAL implementation is synthesized and implemented on an FPGA. 4.1. Modeling results One of the major RVC advantages is that it accompanies its normative description language (RVC- CAL) with many supporting tools that enable automatic code generation into software (CAL2C) and hardware (CAL2HDL) [8]. The results section in [2] shows the lines of code, development time and number of developers consumed in the RVC-CAL, C and VHDL implementations correspondingly. Although the number of iterations performed on the RVC-CAL implementation exceed the other two implementations with a factor of four or five, the development effort consumed is much less. Such saving reflected in Figure 6 contributes directly to production costs and Time to Market (TTM). Figure 6 - Gain from using the RVC-CAL implementation Figure 6 shows a comparison, normalized with respect to the VHDL implementation. The RVC-CAL implementation is as double as productive as the C/C++ sequential model and four times as productive as the VHDL behavioral implementation. Additional enhancements to RVC-CAL will embed built-in functions that would even double the RVC-CAL productivity. Figure 5 - CAVLC RVC-CAL network 4.2. Search space co-exploration

Figure 7 - Intra prediction search space coexploration Following the proposed methodology, the intra prediction module initial design was validated against the AVC reference software. However, upon synthesis, the resulting FPGA implementation did not achieve throughput requirements for real-time SDTV (720x480). Therefore, optimization iterations were needed. Figure 7 and Table 1 show the resource/throughput states at the end of the optimization iterations. Optimization proceeds by searching for the critical path and splitting it. This could be done by dividing actors or actions. In the first iteration, the PE Controller was split into two actors and their outputs multiplexed to the PEs. However, this multiplexing increased the critical path which made the throughput worse. Therefore, in the second iteration cycle, instead of multiplexing, the PEs were duplicated. In the third and fourth iterations, the PE Controller was split into three and four actors. Splitting into four actors resulted in an implementation that exceeded the FPGA resources. However, having three PE Controllers resulted in the best tradeoff between throughput and resources. Therefore, the fifth iteration targeted the actions in each of the three PE Controllers. There is a tradeoff between the number of actions and size of the state machine. A large number of small actions complicate the state machine, which increases the control logic for it. On the other hand, a small number of actions simplify the state machine, but the control logic in the actions themselves is complicated. This means that the right balance between action and state machine size must be found for each actor. Therefore, in the fifth iteration, the simple actions were grouped and the complicated actions were divided to achieve a balance between state machine size and action size. This led to meeting both throughput and resource requirements. 4.3. Synthesis results The gain achieved on the modeling level in terms of lines of code is echoed as well on the hardware implementation level. The HDL model is generated from the presented RVC-CAL model using the CAL2HDL tool. The HDL code is synthesized using Xilinx ISE targeting a Xilinx Virtex 5 FPGA. Table 1 shows an example of the proposed optimization iterations applied to the intra prediction module showing synthesis results at the end of each iteration. The methodology performs several optimization iterations until the target design goals are achieved. Table 1 - Intra prediction exploratory iterations Iteration CLK Frequency (MHz) # of Registers # of LUTs Throughput (Frames/s) 0 49.615 5533 7761 10 1 46.755 6892 9021 9 2 85.69 4554 5079 21 3 110.011 5755 4586 27 4 Exceeded Available Resources 5 120.221 5477 6293 30 5. Conclusion In this paper, a methodology for efficient implementation of algorithms targeting reconfigurable architectures was presented. The methodology defines an iterative design flow, a set of tools and practices for architecture exploration and optimization. The use of RVC-CAL as the modeling language for the proposed methodology, exploits parallelism and reconfigurability supported by the newer generations of multimedia platforms. In addition, the strong abstraction and encapsulation properties exhibited by RVC-CAL allow for faster optimization iterations. An MPEG RVC AVC baseline encoder case study was provided to demonstrate the gain achieved from using the proposed methodology, on both the modeling and implementation levels. The results show that the proposed approach minimizes the time and effort consumed during development. The results are also reflected on the hardware implementation level, taking Intra prediction as example. 6. References [1] Bhattacharyya, S., Brebner, G., Janneck, J., Eker, J., von Platen, C., Mattavelli, M., Raulet, M. OpenDF A Dataflow Toolset for Reconfigurable Hardware and Multicore Systems. First Swedish Workshop on Multi-Core Computing, 2008.

[2] Aman-Allah, H., Hanna, E., Maarouf, K., Amer, I. Towards a Comprehensive RVC VTL: A CAL Description of an Efficient AVC Baseline Encoder. IEEE International Conference on Image Processing (ICIP 2009), 2009. [3] Janneck, J., Miller, I. D., Parlour, D. B., Roquier, G., Wipliez, M., Raulet, M. Synthesizing hardware from dataflow programs: An MPEG-4 simple profile decoder case study. IEEE Workshop on Signal Processing Systems (SiPS 2008), 2008. [4] Lucarz, C., Faure, P., Roquier, G., Mattavelli, M., Noel, V., Amer, I., and Alisafee, M. ACTORS European Project, technical report, 2009-01, Micro-Electronic Systems Laboratory, Ecole Polytechnique Federale de Lausanne (EPFL), 2009. [5] Yang, W. (2003). An Efficient Motion Estimation Method for MPEG-4 Video Encoder. IEEE Transactions on Consumer Electronics, 49(2). [6] Richardson, I. E. G. H.264 and MPEG-4 Video Compression. Aberdeen. Wiley, 2003. [7] Huang, Y., Hsieh, B., Tung-Chien C., Chen, L. Analysis, Fast Algorithm, and VLSI Architecture Design for H.264/AVC Intra Frame Coder. IEE Transactions on Circuits and systems for Video Technology, 15(3), 2005, 378-401. [8] Lucarz, C., Mattavelli, M., Wipliez, M., Roquier, G., Raulet, M., Janneck, J., et al. Dataflow/Actor-Oriented language for the design of complex signal processing systems. (DASIP, 2008) [9] Silva, T., Vortmann, J., Agostini, L., Bampi, S., Susin, A. FPGA Based Design of CAVLC and Exp-Golomb Coders for H.264/AVC Baseline Entropy Coding. 3rd Southern Conference on Programmable Logic (SPL07), 2007.