A Dedicated Hardware Solution for the HEVC Interpolation Unit

Similar documents
A Hardware Solution for the HEVC Fractional Motion Estimation Interpolation

Research Article A High-Throughput Hardware Architecture for the H.264/AVC Half-Pixel Motion Estimation Targeting High-Definition Videos

An HEVC Fractional Interpolation Hardware Using Memory Based Constant Multiplication

RFCAVLC8t: a Reference Frame Compression Algorithm for Video Coding Systems

High Performance VLSI Architecture of Fractional Motion Estimation for H.264/AVC

Optimized architectures of CABAC codec for IA-32-, DSP- and FPGAbased

An Efficient Hardware Architecture for H.264 Transform and Quantization Algorithms

A LOW-COMPLEXITY AND LOSSLESS REFERENCE FRAME ENCODER ALGORITHM FOR VIDEO CODING

EE 5359 Low Complexity H.264 encoder for mobile applications. Thejaswini Purushotham Student I.D.: Date: February 18,2010

New Motion Estimation Algorithms and its VLSI Architectures for Real Time High Definition Video Coding

EE Low Complexity H.264 encoder for mobile applications

Comparative and performance analysis of HEVC and H.264 Intra frame coding and JPEG2000

White paper: Video Coding A Timeline

Reducing/eliminating visual artifacts in HEVC by the deblocking filter.

An Efficient Mode Selection Algorithm for H.264

Transcoding from H.264/AVC to High Efficiency Video Coding (HEVC)

High Efficiency Video Coding (HEVC) test model HM vs. HM- 16.6: objective and subjective performance analysis

Analysis of Information Hiding Techniques in HEVC.

Analysis of Motion Estimation Algorithm in HEVC

HIGH-PERFORMANCE RECONFIGURABLE FIR FILTER USING PIPELINE TECHNIQUE

Digital Video Processing

Bi-directional optical flow for future video codec

Performance Comparison between DWT-based and DCT-based Encoders

N RISCE 2K18 ISSN International Journal of Advance Research and Innovation

MultiFrame Fast Search Motion Estimation and VLSI Architecture

HEVC The Next Generation Video Coding. 1 ELEG5502 Video Coding Technology

High-Performance VLSI Architecture of H.264/AVC CAVLD by Parallel Run_before Estimation Algorithm *

A SCALABLE COMPUTING AND MEMORY ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION ON FIELD-PROGRAMMABLE GATE ARRAYS. Theepan Moorthy and Andy Ye

Fast frame memory access method for H.264/AVC

Aiyar, Mani Laxman. Keywords: MPEG4, H.264, HEVC, HDTV, DVB, FIR.

Homogeneous Transcoding of HEVC for bit rate reduction

An Efficient Table Prediction Scheme for CAVLC

Deblocking Filter Algorithm with Low Complexity for H.264 Video Coding

Federal University of Pelotas UFPel Group of Architectures and Integrated Circuits Pelotas Brasil

STUDY AND IMPLEMENTATION OF VIDEO COMPRESSION STANDARDS (H.264/AVC, DIRAC)

Fast Decision of Block size, Prediction Mode and Intra Block for H.264 Intra Prediction EE Gaurav Hansda

VHDL Implementation of H.264 Video Coding Standard

High-Throughput Parallel Architecture for H.265/HEVC Deblocking Filter *

Multimedia Decoder Using the Nios II Processor

Multi-level Design Methodology using SystemC and VHDL for JPEG Encoder

A COMPARISON OF CABAC THROUGHPUT FOR HEVC/H.265 VS. AVC/H.264. Massachusetts Institute of Technology Texas Instruments

A SIMULINK-TO-FPGA MULTI-RATE HIERARCHICAL FIR FILTER DESIGN

A full-pipelined 2-D IDCT/ IDST VLSI architecture with adaptive block-size for HEVC standard

International Journal of Emerging Technology and Advanced Engineering Website: (ISSN , Volume 2, Issue 4, April 2012)

IN RECENT years, multimedia application has become more

FPGA Based Low Area Motion Estimation with BISCD Architecture

Advanced Video Coding: The new H.264 video compression standard

OVERVIEW OF IEEE 1857 VIDEO CODING STANDARD

Pipelined Fast 2-D DCT Architecture for JPEG Image Compression

Upcoming Video Standards. Madhukar Budagavi, Ph.D. DSPS R&D Center, Dallas Texas Instruments Inc.

Selected coding methods in H.265/HEVC

Design of 2-D DWT VLSI Architecture for Image Processing

Implementation of Pipelined Architecture Based on the DCT and Quantization For JPEG Image Compression

FPGA Implementation of High Performance Entropy Encoder for H.264 Video CODEC

MPEG RVC AVC Baseline Encoder Based on a Novel Iterative Methodology

Coding of Coefficients of two-dimensional non-separable Adaptive Wiener Interpolation Filter

Performance Analysis of DIRAC PRO with H.264 Intra frame coding

A Novel Deblocking Filter Algorithm In H.264 for Real Time Implementation

Comparative and performance analysis of HEVC and H.264 Intra frame coding and JPEG2000

Intra Prediction Efficiency and Performance Comparison of HEVC and VP9

POWER CONSUMPTION AND MEMORY AWARE VLSI ARCHITECTURE FOR MOTION ESTIMATION

MediaTek High Efficiency Video Coding

A COST-EFFICIENT RESIDUAL PREDICTION VLSI ARCHITECTURE FOR H.264/AVC SCALABLE EXTENSION

Optimizing the Deblocking Algorithm for. H.264 Decoder Implementation

Implementation of A Optimized Systolic Array Architecture for FSBMA using FPGA for Real-time Applications

arxiv: v1 [cs.mm] 9 Aug 2017

Next-Generation 3D Formats with Depth Map Support

[30] Dong J., Lou j. and Yu L. (2003), Improved entropy coding method, Doc. AVS Working Group (M1214), Beijing, Chaina. CHAPTER 4

Improved Context-Based Adaptive Binary Arithmetic Coding in MPEG-4 AVC/H.264 Video Codec

FPGA Implementation of Low Complexity Video Encoder using Optimized 3D-DCT

Intra Prediction Efficiency and Performance Comparison of HEVC and VP9

Low power context adaptive variable length encoder in H.264

EFFICIENT DEISGN OF LOW AREA BASED H.264 COMPRESSOR AND DECOMPRESSOR WITH H.264 INTEGER TRANSFORM

SAD implementation and optimization for H.264/AVC encoder on TMS320C64 DSP

FPGA Implementation of Multiplierless 2D DWT Architecture for Image Compression

COMPARISON OF HIGH EFFICIENCY VIDEO CODING (HEVC) PERFORMANCE WITH H.264 ADVANCED VIDEO CODING (AVC)

NEW CAVLC ENCODING ALGORITHM FOR LOSSLESS INTRA CODING IN H.264/AVC. Jin Heo, Seung-Hwan Kim, and Yo-Sung Ho

EFFICIENT PU MODE DECISION AND MOTION ESTIMATION FOR H.264/AVC TO HEVC TRANSCODER

Design of a Multiplier Architecture Based on LUT and VHBCSE Algorithm For FIR Filter

Image/video compression: howto? Aline ROUMY INRIA Rennes

Objective: Introduction: To: Dr. K. R. Rao. From: Kaustubh V. Dhonsale (UTA id: ) Date: 04/24/2012

IMPLEMENTATION OF DEBLOCKING FILTER ALGORITHM USING RECONFIGURABLE ARCHITECTURE

Exploring the Design Space of HEVC Inverse Transforms with Dataflow Programming

LIST OF TABLES. Table 5.1 Specification of mapping of idx to cij for zig-zag scan 46. Table 5.2 Macroblock types 46

2014 Summer School on MPEG/VCEG Video. Video Coding Concept

High Efficiency Video Decoding on Multicore Processor

IMPLEMENTATION OF H.264 DECODER ON SANDBLASTER DSP Vaidyanathan Ramadurai, Sanjay Jinturkar, Mayan Moudgill, John Glossner

DESIGN AND IMPLEMENTATION OF SDR SDRAM CONTROLLER IN VHDL. Shruti Hathwalia* 1, Meenakshi Yadav 2

Bit Allocation for Spatial Scalability in H.264/SVC

Xin-Fu Wang et al.: Performance Comparison of AVS and H.264/AVC 311 prediction mode and four directional prediction modes are shown in Fig.1. Intra ch

H.264/AVC Baseline Profile to MPEG-4 Visual Simple Profile Transcoding to Reduce the Spatial Resolution

A Single-Issue DSP-based Multi-standard Media Processor for Mobile Platforms

Reduced Frame Quantization in Video Coding

Video Codecs. National Chiao Tung University Chun-Jen Tsai 1/5/2015

RECOMMENDATION ITU-R BT

FPGA Implementation of 4-Point and 8-Point Fast Hadamard Transform

OPTIMIZED FORWARD/INVERSE QUANTIZATION UNIT FOR H.264/AVC CODECS

A comparison of CABAC throughput for HEVC/H.265 VS. AVC/H.264

Overview, implementation and comparison of Audio Video Standard (AVS) China and H.264/MPEG -4 part 10 or Advanced Video Coding Standard

A COST SHARED QUANTIZATION ALGORITHM AND ITS IMPLEMENTATION FOR MULTI - STANDARD VIDEO CODECS. A Thesis Submitted to the College of

Transcription:

XXVII SIM - South Symposium on Microelectronics 1 A Dedicated Hardware Solution for the HEVC Interpolation Unit 1 Vladimir Afonso, 1 Marcel Moscarelli Corrêa, 1 Luciano Volcan Agostini, 2 Denis Teixeira Franco {vafonso, mmcorrea, agostini}@inf.ufpel.edu.br, denisfranco@furg.br 1 UFPel Federal University of Pelotas 2 FURG Federal University of Rio Grande Abstract The increasing demand for high-resolution digital video for many different types of applications stimulated the efforts to improve the performance in video coding. This work presents a hardware architecture developed for the Interpolation Unit necessary for the Fractional Motion Estimation step defined in the emerging High Efficiency Video Coding (HEVC). The architectural solution was fully described in VHDL and obtained good results of processing rate and cost, achieving a maximum frequency of 187 MHz and processing up to 28 QFHD (3840x2160) frames per second. 1. Introduction Video coding is currently an important research area in function of the increase demand for highdefinition digital videos. There are different video coding standards, and these standards primarily define two things: (1) a coded representation (or syntax), which describes the visual data in a compressed form, and (2) a method to decode the syntax to reconstruct the visual information [1]. Nowadays, the most efficient video coding standard available is the H.264/AVC (Advanced Video Coding) [2]. However, video resolutions and the amount of information that must be processed are continuously increasing. Because of this demand, a new video coding standard, designed to be more efficient than the H.264/AVC, is being developed in a joint effort of the ITU-T and ISO/IEC, through the JCT-VC (Joint Collaborative Team on Video Coding) group. This new standard is currently known as HEVC (High Efficiency Video Coding) [3], and represents the state-of-the-art of video coding tools. There are some important factors that should be considered in video coding. One of these factors is the computational complexity, since the real-time processing requires a very high calculation power to compress high-definition videos even modern general-purpose processors are not able to reach this restriction. In addition, software solutions running in an embedded system are not appropriate for applications like video encoders, because it would be necessary a very high-performance processor, prohibitively increasing the energy consumption. This is a general problem, but it is critical for mobile devices. Therefore, it is necessary the development of high-efficiency ASICs (Application Specific Integrated Circuits) with expressive processing rates and low energy consumption to support the HEVC standard. Video encoders have different modules to explore each redundancy type present in visual information. The interframe prediction block uses the Motion Estimation (ME) process and it is responsible for the best results in terms of compression rates in the encoder, but it is also the most complex process [4]. It explores and reduces the temporal redundancy, which is the similarity among sequential frames and it works splitting the current frame into several blocks and then searching in the previous coded frames (reference frames) for the block that is most similar to the current one. After this search, a motion vector (MV) is generated to indicate the offset of the selected block inside the reference frame. The goal of ME is to find the best MV whilst keeping the computational complexity inside acceptable limits [1]. An important technique that can be used in ME is the Fractional Motion Estimation (FME). The FME allows an even greater efficiency by applying an interpolation process between integer position samples in reference frames, allowing a search for better matches in fractional positions. The FME is composed by two units: the Interpolation Unit, that generates the fractional position samples (sub-pixels), and the Search Unit, which searches for better matches composed by sub-pixels. The HEVC emerging standard proposes new algorithms for the FME process, making it more efficient than the H.264/AVC FME. Therefore, the design of high-performance hardware architectures for the HEVC interpolation process is an important research effort.

2 XXVII SIM - South Symposium on Microelectronics 2. State-of-the-Art and Related Works In the HEVC proposal, some modifications were included in the interpolation filters used in the Interpolation Unit to improve the results obtained with the FME. While the H.264/AVC standard uses two dependent steps to generate the luminance samples in quarter-pixel positions, the HEVC produces the samples in quarter-pixel positions in a single step. In H.264/AVC, the first step is responsible to obtain the samples in half-pixel positions from luminance samples in integer positions using a 6-tap FIR filter (Finite Impulse Response). After the interpolation, a search process in half-pixel positions to find a better match must take place. In the second step, the luminance samples in quarter-pixel positions are obtained through a bilinear filter using the data from the last search. In HEVC, the process used to generate luminance samples for quarter-pixel positions applies an 8-tap DCT filter (Discrete Cosine Transform) and it is capable to calculate both half-pixels and quarter-pixels in a single step. It allows the FME to use a single search process for all sub-pixels [5]. As previously cited, the most efficient video coding standard currently available is the H.264/AVC. Regarding this standard, many works can be found in the literature including works related with the development of hardware architectures for FME, such as [6 10]. We did not found works in the literature directly related to the HEVC standard, and then was not possible to make comparisons between our results and related works. 3. Proposed Interpolation Unit Architecture The hardware architecture designed in this work was based in two documents elaborated by JCT-VC [5][11]. In those documents, it is possible to find all processes used to generate the sub-pixel positions, according to the HEVC specification. Fig. 1 represents the integer samples (shaded blocks) and fractional sample positions (white blocks) for quarter-pixel interpolation in HEVC. This figure and the equations (1-6) which are used to obtain the fractional positions are important to understand the interpolation process. Fig. 1 Integer samples and fractional samples used in the HEVC quarter-pixel interpolation. Based on the algorithms given in [11] it was possible to find similarities in the equations of some fractional positions. Equation (1) is used to calculate horizontally aligned sub-pixels between integer positions (a 0,0, b 0,0 and c 0,0 ), and equation (2) for vertically aligned sub-pixels between integer positions (d 0,0, h 0,0 and n 0,0 ). The variables C 2 -C 7 are coefficients that change according to sub-pixel positions, and their values are showed in tab. 1.

XXVII SIM - South Symposium on Microelectronics 3 To calculate sub-pixels that are not horizontally or vertically aligned between integer positions, it is necessary to calculate the intermediate values d1 i,0, h1 i,0 and n1 i,0 before, using equation (3). Afterwards, the final prediction is realized. For the sub-pixels e 0,0, f 0,0 and g 0,0 equation (4) is used. The sub-pixels i 0,0, j 0,0 and k 0,0 come from equation (5) and the sub-pixels p 0,0, q 0,0 and r 0,0 are produced by equation (6). ( A 3,0 C2 A 2,0 C3 A 1,0 C4A0,0 C5A1,0 C6 A2,0 C7A3,0 A4, 0 32) 6 (1) ( A 0, 3 C 2 A 0, 2 C 3 A 0, 1 C 4 A 0,0 C 5 A 0,1 C 6 A 0,2 C 7 A 0,3 A 0,4 32) 6 (2) A i, 3 C 2 A i, 2 C 3 A i, 1 C 4 A i,0 C 5 A i,1 C 6 A i,2 C 7 A i,3 A i,4 (3) ( 0 d 1 3,0 C2d1 2,0 C3d1 1,0 C4d10,0 C5d11,0 C6d12,0 C7d13,0 d14, 2048) 12 (4) ( h 1 3,0 C2h1 2,0 C3h1 1,0 C4h1 0,0 C5h11,0 C6h1 2,0 C7h1 3,0 h14, 0 2048) 12 (5) ( n 1 3,0 C2n1 2,0 C3n1 1,0 C4n1 0,0 C5n11,0 C6n1 2,0 C7n1 3,0 n14, 0 2048) 12 (6) Tab. 1 Coefficients of filters to generate sub-pixel positions Type Positions Coefficients (C 2 C 7 ) L(left) a i,j, d i,j, e i,j, i i,j, p i,j {4, 10, 57, 19, 7, 3} M (middle) b i,j, h i,j, f i,j, j i,j, q i,j {4, 11, 40, 40, 11, 4} R (right) c i,j, n i,j, g i,j, k i,j, r i,j {3, 7, 19, 57, 10, 4} In the specification of the designed architecture, blocks of 8x8 size were considered and three different filters were developed, one type for each coefficients group, according to tab. 1. Each one of the filters has three output values: one to implement equations (1-2); one to implement equation (3) and one to implement equations (4-6). The multiplications presented in equations (1-6) were converted in shifts-add to decrease the hardware use and the power consumption and to increase the architecture processing rate. In addition to the filters used in the operative part, register banks were described to implement the barrel shifting operation needed to implement the multiplications. The filters receive an entire line or column coming from these register banks and they can calculate an entire line or column of sub-pixel positions in one clock cycle. Fig. 2 shows the simplified block diagram of the designed architecture. 4. Synthesis Results Fig. 2 Simplified block diagram of the designed architecture. The proposed design was fully described in VHDL and synthesized to a Xilinx Virtex4 XC4VLX15 FPGA device, using the Xilinx ISE 10.1 synthesis tool.

4 XXVII SIM - South Symposium on Microelectronics As shown in fig. 2, this design needs a total of 27 interpolation filters (9 of each type) in order to simplify the control unit and achieve a high efficiency. Also, register banks are used to store and shift samples. When synthesized to the XC4VLX15 device, the architecture achieved a maximum frequency of 187 MHz. Since our design needs 52 clock cycles to process an entire block, this frequency allows it to process 111 Full HD (1920x1080) frames per second or 28 QFHD (3840x2160) frames per second. The operative module of the architecture, composed by the filters, consumed 4615 slices (75% of the target device resources), 3309 slice flip-flops (26%) and 8333 4-input LUTs (67%). The buffers were mapped as registers banks and the amount of hardware elements used by them is shown in tab. 2. Tab. 2 Hardware elements used by the buffers. Buffer 8-bit REG 10-bit REG 11-bit REG 16-bit REG + 4:1 MUX + 2:1 MUX + 2:1 MUX + 2:1 MUX Integer Samples 256 - - - V 1 Type - - - 432 V Type - 216 - - H Type - 216 - - D Type - - 729 - The use of hardware resources may seem high, but it is important to note that this is not an implementation optimized for FPGA devices, and therefore DSP blocks and dedicated memory were not used, since this is a multiplier free implementation. 5. Conclusions and Future Works In this paper, a high performance hardware architecture for the HEVC sub-pixel interpolation unit was presented. When synthesized to a Xilinx Virtex4 FPGA, our architecture achieves a very high processing rate. It can process very high definition videos like QFHD in real time, and it can also process Full HD videos when running at lower frequencies, which is very important when power and energy consumption restrictions are being considered. As future works, it is planned to design a fully functional Fractional Motion Estimation architecture that uses a fast search algorithm for the HEVC video coding standard. 6. References [1] I. G. Richardson, H.264 and MPEG-4 Video Compression, Wiley, 2003. [2] Joint Video Team (JVT) of ITU-T VCEG and ISO/IEC MPEG, Draft ITU-T Recommendation and Final Draft International Standard of Joint Video Specification, ITU-T Rec. H.264 and ISO/IEC 14496-10 AVC, May 2003. [3] G. Sullivan, and T. Wiegand, Draft requirements for next-generation video coding project, International Telecommunication Union, 2009. [4] A. Puri, et al., Video Coding Using the H.264/MPEG-4 AVC Compression Standard. Elsevier Signal Processing: Image Communication, 2004. [5] K. McCann, B. Bross, S. Sekiguchi, and W. Han, High Efficiency Video Coding (HEVC) Test Model 3 Encoder Description, 2011. [6] M. M. Corrêa, and M. T. Schoenknecht, A H.264/AVC quarter-pixel motion estimation refinement architecture targeting high resolution videos, VII Southern Conference on Programmable Logic, 2011. [7] M. M. Corrêa, M. T. Schoenknecht, and R. S. Dornelles, A high-throughput hardware architecture for the H.264/AVC half-pixel motion estimation targeting high-definition videos, International Journal of Reconfigurable Computing, 2011. [8] S. Yalcin, and I. Hamzaoglu, A high performance hardware architecture for half-pixel accurate H.264 motion estimation, 14th International Conference on Very Large Scale Integration and System-on- Chip, October 2006. [9] S. Oktem, and I. Hamzaoglu, An efficient hardware architecture for quarter-pixel accurate H.264 motion estimation, 10th Euromicro Conference on Digital System Design Architectures, Methods and Tools, August 2007.

XXVII SIM - South Symposium on Microelectronics 5 [10] C. Kao, H. Kuo, and Y. Lin, High performance fractional motion estimation and mode decision for the H.264/AVC, IEEE International Conference on Multimedia and Expo, July 2006. [11] HM3. HEVC Test Model Reference Software. Jan, 2012; http://hevc.info/hm-doc/