A Very High Throughput Deblocking Filter for H.264/AVC

Size: px

Start display at page:

Download "A Very High Throughput Deblocking Filter for H.264/AVC"

Philomena Robertson
5 years ago
Views:

1 DOI.0/s-0-0- A Very High Throughput Deblocking Filter for H./AVC M. Kthiri & B. Le Gal & P. Kadionik & A. Ben Atitallah Received: October 0 / Revised: December 0 / Accepted: March 0 # Springer Science+Business Media New York 0 Abstract This paper presents a novel hardware architecture for the real-time high-throughput implementation of the adaptive deblocking filtering process specified by the H./AVC video coding standard. A parallel filtering order of six units is proposed according to the H./AVC standard. With a parallel filtering order (fully compliant with H./AVC) and a dedicated data arrangement in local memory banks, the proposed architecture can process filtering operations for one macroblock with less filtering cycles than previously proposed approaches. Whereas, filtering efficiency is improved due to a novel computation scheduling and a dedicated architecture composed of six filtering cores. It can be used either into the decoder or the encoder as a hardware accelerator for the processor or can be embedded into a full-hardware codec. This developed Intellectual Property block-based on the proposed architecture supports multiple and high definition processing flows in real time. While working at clock frequency of 0 MHz, synthesized under nm low power and low voltage CMOS standard cell technology, it easily meets the throughput requirements for k video at 0 fps of all the levels in H./AVC video coding standard and consumes.0 Kgates. M. Kthiri (*) : B. Le Gal : P. Kadionik IMS laboratory - ENSEIRB-MATMECA, University Bordeaux, CNRS UMR,, Cours de la Libération, 0 Talence Cedex, France kthiri@enseirb.fr B. Le Gal bertrand.legal@ims-bordeaux.fr P. Kadionik kadionik@enseirb-matmeca.fr A. B. Atitallah High Institute of Electronics and Communication, University of Sfax, 0 Sfax, Tunisia ahmed.benatitallah@isecs.rnu.tn Keywords Deblocking filter. Filtering order. ASIC. H./AVC video coding Introduction In the beginning of 00, the H./AVC algorithm was presented as a promising solution for the multimedia market due to its higher compression efficiency compared to other video encoding algorithms such as MPEG-, H. and MPEG- []. Comparative studies reveal that, while maintaining the same video quality, the stream generated by the H./AVC algorithm occupies approximately half of the bandwidth required by the MPEG- algorithm []. In order to increase global video encoding efficiency, the H./AVC standard improves some traditional MPEG internal modules, for example DCT (using a integer version) and inter-frame motion estimation (supporting quarter pixel resolution, multi-frame and variable block size). Moreover, several additional features have been incorporated in the H./AVC standard, which include intraframe prediction, CABAC and a deblocking filter []. An important H./AVC advantage is the inclusion of an antiblocking filter also named deblocking filter. This filter, applied to the final images, improves video quality by attenuating blocking artifact effects, which are normally found in decoded images. As a result, the final subjective quality is significantly improved, allowing the maintenance of the video quality while reducing the bitrate. The drawback of the deblocking filter comes from its high computational complexity. In fact, one of the most important pieces of information in the complexity analysis of a system is the distribution of time complexity amongst its major subsystem. In [], the authors have generated results that have been averaged over all sequences in the test set. As a result, loop filtering ( %)

2 and interpolation ( %) are the largest components, followed by bitstream parsing and entropy decoding ( %), and inverse transforms and reconstruction ( %). The deblocking filter is the most complex functional block of the decoder. It consumes approximately more than one-third of the computational complexity of the H./AVC decoder (Fig. ). Thus, fast computation of the deblocking filter is necessary for high-definition video processing. Due to its high complexity, wide research has been carried out regarding the implementation of the H./AVC deblocking filter. The main source of its complexity can be attributed to the fact that each pixel must be read a number of times in different directions to filter a complete macroblock. To deal with this problem, several processing orders were proposed in previous works, all of them aiming to decrease computation time and amount of memory used in the filtering process. In this paper, we propose a new filtering order for the deblocking filter and we propose a new architectural design for this filtering order. The architecture was described in VHDL language and was validated first in simulation and then with a FPGA device (using a co-design based approach). Finally it was implemented targeting a nm low power and low voltage ASIC technology. This paper is structured as follows: Section outlines the algorithm of the deblocking filter. Section is devoted to the presentation of the filter ordering solutions published in the literature. Proposed filtering order solution as well as its hardware architecture is presented. Section reports the results and compares them to the other related works. Section concludes. Deblocking Filter In the H./AVC, the deblocking filter is applied to all four edges of each block in one diagram. In Fig., macroblocks are processed following raster scan order. For each macroblock, the vertical edges are first filtered rightwards and then the horizontal edges downwards. As shown in Fig., the luma macroblock is first processed vertically, i.e. from g to j; and then horizontally from k to n. The chroma components follow the same rule. Each pixels on a straight line of two adjacent blocks, such as (p,p,p,p 0 ) and (q 0,q l,q,q ) in Fig. (a) are sent to the filter at the same time. The H./AVC deblocking filter is highly adaptive. There are several conditions that determine:. Whether a block edge will be filtered or not. The strength of the filtering for the block edges that will be filtered. The Boundary Strength (BS) parameter, α and β thresholds, and the values of the pixels in the edge determine the outcomes of these conditions. The BS parameter varies adaptively according to the quantization step-size used when the block was coded, on the coding mode of neighboring blocks and the gradient of the values of the pixels computed across the edge being filtered []. Five strength levels exist (BS=[0, ]). BS equals to 0 means no filtering and BS= indicates maximum smoothing. Figure illustrates the principle of the deblocking filter using a one-dimensional visualization of a block edge. In Fig., {q 0,q l,q,q } represent the pixels from the current block, whereas {p 0,p,p,p } represent the adjacent block, as detailed in Fig.. Whether the pixels p 0 and q 0, as well as p and q are filtered is determined by the Quantization Parameter (QP) and the threshold variables α and β that are used to prevent true edges from being filtered. The values of α and β depend on QP. The filtering strength for an edge is determined by comparing pixel gradients with α and β threshold values for that edge. Thus, filtering of p 0 and q 0 only takes place if the following content activity check operations are satisfied (): BS 0 and jp 0 q 0 j < α and jp p 0 j < β and jq q 0 j < β ðþ Correspondingly, filtering of p or q occurs if () is satisfied: jp p 0 j < β and jq q 0 j < β ðþ Figure Profiling of H/AVC decoder []. The dependency of α and β on the QP links the strength of filtering to the general quality of the reconstructed picture prior to filtering. The basic idea is that if a relatively large absolute difference between samples near a block edge is measured, it is quite likely to be a blocking artifact and should therefore be reduced. However, if the magnitude of that difference is so large that it can no longer be explained

3 Figure Vertical and horizontal edges in one macroblock. a p p p p 0 q q q q 0 k l m p p p p0 q0 q q q b r n s g h i j Luma components p q chroma components by the coarseness of the QP used in the encoding, the edge is more likely to reflect the actual behavior of the source picture and should not be smoothed over. The next paragraphs present the two variations of the deblocking algorithm according to the BS value.. Algorithm for 0<BS< Dif ¼ Clipðc 0 ; c 0 ; ðq þ ððp 0 þ q 0 þ Þ ðq ÞÞ p 0 ¼ p þ Dif q 0 ¼ q þ Dif ðþ ðþ ðþ To calculate the new values of p 0 and q 0, the parameter Dif 0 is computed: Dif 0 ¼ Clipðc ; c ; ððððq 0 p 0 Þ Þþðp q ÞþÞ ÞÞ ðþ The parameter c used by the Clip function is defined by the H./AVC standard (clip table) as shown in Table []. As a result, the updated values of p 0 and q 0 (named p 0 and q 0 ) are computed using Eqs. and : p 0 0 ¼ Clipðp 0 þ Dif 0 Þ ðþ q 0 0 ¼ Clipðq 0 Dif 0 Þ ðþ The computation of p and q occurs in the same manner. First, the values of Dif and Dif are determined. After that, p and q are respectively given by: Dif ¼ Clip c 0 ; c 0 ; p þ ððp 0 þ q 0 þ Þ ðp ÞÞ ðþ q 0 -p 0 Block P p p p p 0 Block Q q 0 q q q Figure Principle of a block edge deblocking filtering. p -p 0 q -p 0. Algorithm for BS= The following expressions are used to compute the new values of the filtered pixel sequences, initially considering the current block (Q) and previous block (P), we compute the filtered pixels with the following equations: q 0 0 ¼ ðp þ p 0 þ q 0 þ q þ q þ Þ ðþ q 0 ¼ ðp 0 þ q 0 þ q þ q þ Þ ðþ q 0 ¼ ð q þ q þ q þ q 0 þ p 0 þ Þ ðþ p 0 0 ¼ ðq þ q 0 þ p 0 þ p þ p þ Þ ðþ p 0 ¼ ðq 0 þ p 0 þ p þ p þ Þ ðþ p 0 ¼ ð p þ p þ p þ p 0 þ q 0 þ Þ ðþ For chrominance blocks, the following equations must be adopted: q 0 0 ¼ ð q þ q 0 þ p þ Þ ðþ p 0 0 ¼ ð p þ p 0 þ q þ Þ ðþ

4 Table Value of filter clipping variable c as a function of index A and BS []. Index A 0 0 BS Index A BS 0 Luma components Figure Rules of the edge filtering order. Related Works chroma components In order to filter a macroblock, the value of a pixel must be read multiple times and the intermediate results of the filtering are stored into a local memory. This is because the following computation steps utilize them. In order to improve the use of the local memory and the filtering performances, it is necessary to reorder the filtering operations in such a manner that the intermediate results are used sooner. The only restriction imposed by the standard in relation to the processing order is that the entire horizontal filtering which uses a determined sample must occur before the vertical filtering which adopts this sample. An illustration of the computation order imposed by the standard is provided by Fig.. The processing order proposed by the H./AVC standard [] is presented in Fig.. As evident, the vertical borders of the luminance and chrominance blocks are all filtered before the horizontal borders. Since the results of the vertical filtering are employed in the horizontal filtering, the overall intermediate L L T0 T T T 0 0 Figure Original H./AVC filtering order []. L L L L T T T T 0

5 L T0 T T T 0 T T L 0 L T T T0 T T T 0 T T L 0 L L 0 L L L 0 T T Figure Filtering order proposed in []. L 0 L L 0 results must be stored. Consequently, this processing order is expensive in terms of memory usage and execution time. Indeed, it requires the storage of bytes ( luminance blocks and blocks for each chrominance) until the horizontal filtering occurs. The filtering order proposed by G. Khurana [], presented in Fig., is based on an alternation between horizontal and vertical filtering of the blocks. This solution provides a local memory size decrease, as just one line of blocks must to be stored in order to be used by the next filtering steps. When the pixels are completely filtered (i.e. in both directions), they can be written back to the main memory in order to be shown or to be used as a reference in the future. The proposal of He Jing [], presented in Fig., is based on both data reuse and concurrent processing (using multiple filtering cores) to increase the design throughput. This architecture exploits a parallel filtering order using two edge filters to process simultaneously the vertical and the horizontal edges. Repeated numbers in Fig. correspond to the edge filterings that are executed in parallel during the same clock cycle on the two distinct filtering cores. Figure Filtering order proposed in []. Besides, the processing order proposed in [] and shown in Fig., significantly reduces the number of clock cycles required to process a macroblock. This solution is based on the parallel execution of horizontal and vertical filtering computations. Using the proposed computation schedule, up to three filtering cores can be used to speed up the data processing. The number of concurrent filterings is limited due to data dependencies between Macroblocks. Therewith, based on the filtering schedule proposed in [], up to four edges filters are possible. The order of the edge filtering process is provided in Fig.. In fact, the vertical edges of the first sub-block-row of a MB, that is, edges numbered as 0 in Fig. are processed successively to reuse the content data as efficiently as possible. After the left and right vertical edges of a sub-block are successfully filtered, the sub-block data are transposed and then transferred to the second stage of the pair, that is, the vertical filtering process, which performs deblocking filtering on horizontal edges. T0 T T T T T 0 L L 0 T0 T T T 0 L L T T L L L L T T L L L L T T Figure Filtering order proposed in []. Figure Filtering order proposed in [].

6 L L T0 T T T 0 Solution Based on Edge-Filter Units. A Filtering Order for Up to Parallel Computations According to the restriction imposed by the H./AVC standard, it would be possible to perform three or more concurrent filterings in the same macroblock without a significant increase of the local memory size. All the processing orders presented before are performed at the block level, i.e. the filtering of a block edge is performed serially by the same filter and the border of a block can be filtered only after the filtering of the LOPs Figure Proposed filtering order. L L L L T T T T 0 (Line of Pixels) of the previous (left) block (with a certain parallelism of computation, while respecting the constraints imposed by the standard H./AVC). The architecture proposed in this paper is based on a new processing order and a dedicated local memory organization. Moreover, since the deblocking filter for chrominance pixels is almost identical to the one for luminance pixels, the data path can be shared with the effect of minimizing idle cycles of the edge filter. Our sample oriented processing order allows a more effective use of the architecture parallelism without significantly increasing local memory size. Figure demonstrates the proposed filtering order. This processing order produces the same functional results as the order specified in the H./AVC standard []. Considering this processing order, up to six filterings may occur in parallel resulting in throughput increases when the architectural design is composed of six filter cores as detailed in Section.. Hardware Architecture Based on Edge Filter Units Based on the proposed edge filter scheduling, we have designed a dedicated architecture composed of six filter units. The architecture is shown in Fig.. The hardware architecture exploits six identical filter units to enhance the processing throughput. Three edge filter units are dedicated to the horizontal edges and the three others are dedicated to vertical ones. input bus -bit Start Memory control RAM * bits () RAM*bits (Chrominance Cr) RAM * bits (Chrominance Cb) Start filter -bit -bit to the appropriate filter to the appropriate filter Filters Control Mux *-bit FIFO memories *-bit temporal buffer yes T T T T T T if the block will be applied immediately to the filter no QPp QPq OffsetA OffsetB Coding informtion BS Generator FV FV FV FH FH FH If the blocks are totally filtered yes no T inv -bit output bus Figure Proposed deblocking filter architecture.

7 Block cycle 0 Table Input and output from the transpose module. Data input Data output (P) HF (Q) 0 L 0 a,a 0,a 0,a 0 a 00,a,a 0,a 0 a 0,a,a,a a 0,a,a,a a 0,a,a,a a 0,a,a,a a 0,a,a,a a 0,a,a,a (P) L L HF (Q) (P) L 0 L L HF (Q) 0 (P) T0 0 T T T VF (Q) (P) 0 T T VF (Q) (P) T T VF (Q) T Figure Edge computations scheduling units. 0 This filter organization authorizes parallel computations of the horizontal filtering of vertical edges and the vertical filtering of horizontal edges. A BS computation module, one threshold calculator module, one c calculator module, transpose modules, six bit FIFO memories and thirteen bit temporal buffers compose the rest of the architecture. Edge filter computations were scheduled and bind on the filtering units. Scheduling and binding were realized specifically taking into account two main constraints: simplifying the local memory access providing the best usage rate of filtering units. In proposed scheduling, the architecture can start the execution of this scheduling when all the pixel data and the information required BS for computations have been received. This choice was performed to simplify the synchronization of the I/O and computation tasks that have a pipelined execution. Figure summarizes the filtering process within the proposed architecture in terms of block cycles. Each block cycles requires clock cycles (this corresponds to the execution time of each block). The processing starts with the horizontal filtering. As a matter of fact, on the first block cycle, the input pixels to be filtered [p 0,p ]and[q 0,q ]are fetched to the appropriate V-edge filters (HF,HF,HF )from Figure Local memory organization. T 0 T T T 0 T T L L L L Chrominance Left _luma _mem Top _luma _mem line _luma _mem line _luma _mem line _luma _mem line _luma _mem Left _chroma _mem Top _chroma _mem line _ chroma _ mem line _ chroma _ mem

8 Figure I/O and filtering execution sequences using bit I/O interfaces. the left_luma_mem and the line_luma_mem, the left_chromau_mem and the line_chromau_mem and the left_chromav_mem and the line_chromav_mem respectively. In addition, the vertical edge (L 0 ~block 0 ), the vertical edge (L ~block ) and the vertical edge (L ~block 0 ) are simultaneously filtered. Then the blocks L 0,L,L are transferred into the write stage and written into filtered memories. The partially filtered block 0 and block are then forwarded directly to V-edge filter appropriate again (through afifomemory).block 0 is transferred to the appropriate bit temporal buffer in order to be used in the filtering of the edge between block 0 and block. The blocks, L,,are loaded simultaneously on the next clock cycles and the edges block 0 ~block, block~block, block ~block are filtered. The block 0, the block (vertically filtered), the block T and the block T are sent to the transpose register for transposing in order to be used in the vertical filtering of the edges blockt ~block 0 and T ~block (with the suitable filters VF,VF,VF ), the block and block are forwarded to the suitable V-edge filters for edge L ~block and block ~block filtering. This process repeats until all edges are filtered using either bit FIFO memories or bit temporal buffers. To authorize such edge filter scheduling, a dedicated memory binding of data has been developed. Figure shows the memory organization that authorizes the computation scheduling without memory access conflict. In order to guarantee that all transfers could be performed in one clock cycle without access conflict, this architecture is composed of local memory banks (each line of block for luminance or chrominance pixels in independent bit and bit memories respectively). The loop-filter architecture is linked to the rest of the system through two buses: one dedicated to input data and another one to output data. The bus widths are bits. Data provided by the system are stored in the local memory banks according to memory binding presented in Fig.. T and T inv units are required to transpose a block of pixels from rows to columns and from columns to rows respectively. Because the proposed architecture is designed to perform both horizontal and vertical filtering of block edges using the same filter, pixels in each block must be transposed before and after the deblocking filter. The implemented T unit completes the transpose operation of a block in clock cycles (in each clock cycle we receive one LOP). Table presents the operations made by this module for one block. The input of this block is {a i0, a i,a i,a i }withi {0,,, } and the transposed output is {a 0i,a i,a i,a i with i {0,,, }. The control filter module is a finite state machine, responsible for the synchronization of all data transfers (memory Table Comparison with other designs. [] [] [] [] [] [] [] Proposed architecture Technology (μm) (nm) Application Target fps 0 fps fps fps 0 fps fps 000 fps 000 fps Working Frequency a MHz MHz 0 MHz 00 MHz MHz 0 MHz MHz. MHz Gates count (KGates) Number of filter cores Memory (byte) Processing time (cycles/mb) Maximum Throughput (KMB/s) b a Correspond to frequency required to process the appropriate application target b Throughput (KMB/s)=((/Fmax) processing time)

9 read/write and input/output interfacing) in order to ensure the filter module constantly processes new values. The filtering cores perform the filtering operations using samples and values of BS,thresholds(α and β)andc value, which were previously computed. The BS calculator computes the filtering strength and the threshold calculator defines the values of α and β based on the quantization parameters of the two blocks that are being filtered. The c calculator is a module that is based on the filtering strength and on the thresholds values generates a clipping value that is adopted in the filtering process. Propose architecture authorizes a full pipeline of the I/O task with the computation one like in []: once the data from the shared memory banks are consumed once by the computation units, the design can start the next filtering data loading. Indeed, the resulting data generated by the filter cores is stored in bit local memories (temporal buffer) or bit first in first out FIFO to store intermediate data which will be employed in the subsequent computation while data is read from another one. In the same way, once the computations have completed block filtering, the computation results are immediately send to the system. Figure provides an overview of the design behavior. Time required to fill the input memory banks depends on the input bus width. Indeed, depending on bus width, the number of clock cycles required receiving the 0 pixel data from the system changes. To enable full speed processing, bit data interfaces are required. Indeed, using such width reduces the data loading stage to 0 data/ data per cycle=0 cycles. Time required for data loading is lower that the execution one. Implementation and Performance Results We have designed the hardware architecture with VHDL language at the RTL level and synthesized it by using the Design Compiler tool from Synopsys. However, in order to silicon proof the correct behavior of the architecture a co-design based implementation of the architecture was realized on an FPGA target. Architecture validation was first realized in simulation using Modelsim. A VHDL testbench was used to send pixel [] [] [] [] [] [] [] Proposed Figure Throughput comparison (KMB/s). data to the deblocking filter architecture and to store computation results. Input data was extracted from real video stream using the JM decoder tool []. Results generated by the architecture compared to the JM decoded ones. In a second time, we have implemented the architecture in an Virtex- FPGA from Xilinx (M board). The JVM decoder was executed on the PowerPC core in the FPGA and the decoding filter was implemented as an accelerator. The communication was realized using a PLB bus. The JVM tool was hacked to execute () the loop filter computations () to send/receive the data to/from the coprocessor () to check the bit equivalence of software and hardware results. Videos used in this experimentation were stored on a compact flash device. As previously explained, the proposed architecture in this paper considered a new filter ordering and its consequent algorithm. Thus, an analysis considering the number of cycles required to filter a complete Macroblock in each filtering order has been established. As evident in Table the proposed filtering order performs the whole filtering of a Macroblock in clock cycles (the filtering of each edge during a step takes clock cycles). Figure shows the area profiling of the proposed work when targeting at an operating frequency of 0 MHz. In fact, the proposed design improves performances of other works. Area consumption is still low compared to other high-performances architectures. However, proposed solution required more memory bytes. Table Required frequency for video standards. Figure Hardware complexity profiling. Application target fps 0 0@0 fps 0 0@0 fps @0 fps Frequency. MHz. MHz. MHz. MHz

10 The proposed architecture is faster than the other ones in the literature. It allows to achieve higher throughput at identical clock frequency or to require lower frequency when targeting identical throughput. The deblocking filter is a system bottleneck in terms of processing cycles. Based on the proposed architecture, we can greatly reduce the processing cycles (takes only clock cycles) and improve the system throughput, which reduces the number of clock cycles per macroblock by, % ~ 0 %. The synthesis result shows that the proposed design takes.0 kgates, relatively lower than others previous approaches [,, ]. However, we need to put into perspective this area results because we consume more memories bytes. The designs presented in Table including several filter cores executing in parallel way. In fact, when looking into the proposed work in [], we can find that the total cost of this design and ours are comparative although a different memory organization is employed in our architecture. Thus, with our proposed design, we can consume a reasonable area costs and we can accelerate the computation time. Since the proposed architecture owns six edge filters, a significant issue on designing the controller is to almost fully exploit these filters. The design in [] reduces gate count substantially because it performs the filtered MB with pipeline computation. However, this smaller buffer requires more frequent access of external memory that leads to larger power consumption. It is noted that the proposed design contains local memory modules in order to take advantage of parallel computing. Figure compares the throughput performance achieve by this work and some previous works. Indeed, we can see that the proposed design achieves four times of the real-time performance requirement of the recent design []. Similarly, when comparing with [, ], the throughput performance of the proposed design reaches even as high as three times. In conclusion, Compared on [,, ] our design achieves the highest throughput due to lowest processing cycles and relative high working frequency, as well as a slightly increase in the final area of the architecture with the use of six filtering cores when registers are used in place of the memory blocks. Thus, Fig. shows that we can process the same throughput that [] with a lower frequency. In addition, the proposed work provides an effective trade-off between hardware complexity and processing capability. Our deblocking filter is able to perform real time video applications of at 0 fps with low frequency requirements. This is due to the number of the clock cycle required to generate the filtered macroblock. In Table we present the required frequency to process several applications targets based on the provided results of the proposed design. In this manner, the proposed deblocking filter architecture (producing lower dynamic power consumption) can be employed as an IP core either in a dedicated or platform-based H./AVC codec system. Conclusion This paper presents a new hardware approach for implementing the H./AVC deblocking filter. The presented architecture is based on a new processing order with a new memory organization. The related solution provides an efficient filtering order with the respective algorithm, achieving the best results for throughput that other works. This hardware implementation is designed to be used as a part of a H./AVC video decoder or encoder. It benefits several components executed in a parallel mode. It solves the problem of real-time constraints and enables a better efficiency in video coding or decoding (the H./AVC deblocking filter can be used either in the decoder or in the encoder). Acknowledgments This present study was carried out for the RTELI project and funded by the French SYSTEM@TIC ICT cluster []. References. ISO/IEC ISO/IEC MPEG and ITU-T (00). AVC Draft ITU- T ISO/IEC Recommendation and final draft international standard of joint video specification. ISO/IEC and ITU-T.. Richardson, I.E. (August 00). H. and MPEG- video compression (0 pages). England edition. Wiley & Sons.. Wiegand, T., Sullivan, G. J., Bjontegaard, G., & Luthra, A. (00). Overview of the H./AVC video coding standard. IEEE Transactions on Circuits and Systems for Video Technology, (), 0.. Horowitz, M., Joch, A., Kossentini, F., Hallapuro, A. (July 00). H./AVC baseline profile decoder complexity analysis. IEEE Transactions on Circuits and Systems for Video Technology, ().. Khurana, G., Kassim, T., Chua, T., & Mi, M. (00). A pipelined hardware implementation of in-loop deblocking filter in H./AVC. IEEE Transactions on Consumer Electronics, (), 0.. Jing, H., Yan, H., Xinyu, X. (September 00). An efficient architecture for deblocking filter in H./AVC. In the Proceedings of the Fifth International Conference on Intelligent Information Hiding and Multimedia Signal Processing (IIH-MSP 0) (pp. ).. Chien, C.A., Chang, H.C., Gue, J.I. (November 0 - December 00). A high throughput in-loop de-blocking filter supporting H./AVC BP/MP/HP video coding. In Proceedings of the IEEE Asia Pasific Conference on Circuits and Systems (APCCAS 0) (pp. ).. Chen, K. H. (0). cycles-per-macro block deblocking filter accelerator for high-resolution H./AVC decoding. IET Circuits, Devices & Systems, (), 0.. ITU (00). H./AVC reference software decoder (v.). Chen, C. M., & Chen, C. H. (00). Configurable VLSI architecture for deblocking filter in H./AVC. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, (),.. Wei, H., Tao, L. I. N., & Zheng-hui, L. I. N. (00). Parallel processing architecture of H. adaptive deblocking filters. Journal of Zhejiang University, (), 0.. Tobajas, F., CalIicό, G. M., Perez, P. A., de Armas, V., & Sarmiento, R. (00). An efficient double-filter hardware architecture for H./AVC deblocking filtering. IEEE Transactions on Consumer Electronics, (),.

. Xu, K., & Choy, C. S. (00). Five-stage pipeline, 0 cycles/mb, single-port SRAM-based deblocking filter for H./AVC. IEEE Transactions on Circuits and Systems for Video Technology, (),.. Lin, Y. C., & Lin, Y.

systematic-paris-region.org/. Higher Institute of Electronic and Communication of Sfax (Tunisia). He is teaching Embedded System conception and System on Chip.

He received his degree in Instrumentation and communication, from the faculty of science at Sfax, his master in Electronic Engineering from the Sfax National Engineering School (ENIS), Tunisia, in 00

His research interests include digital signal processing, image and video coding with emphasis on H/AVC standards and Co-design implementation.

11 . Xu, K., & Choy, C. S. (00). Five-stage pipeline, 0 cycles/mb, single-port SRAM-based deblocking filter for H./AVC. IEEE Transactions on Circuits and Systems for Video Technology, (),.. Lin, Y. C., & Lin, Y. L. (00). A two-result-per-cycle deblocking filter architecture for QFHD H./AVC decoder. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, (),.. SYSTEM@TIC ICT cluster. Higher Institute of Electronic and Communication of Sfax (Tunisia). He is teaching Embedded System conception and System on Chip. His main research activities are focused on image and video signal processing, hardware implementation, embedded systems. Moez Kthiri was born in Béja, Tunisia, in. He received his degree in Instrumentation and communication, from the faculty of science at Sfax, his master in Electronic Engineering from the Sfax National Engineering School (ENIS), Tunisia, in 00 and Ph.D. degree in electronics from IMS laboratory, University of Bordeaux in 0. He is currently an assistant professor at Higher Institute of Applied Sciences and Technologies of Mateur (Tunisia). His research interests include digital signal processing, image and video coding with emphasis on H/AVC standards and Co-design implementation. Patrice Kadionik received his ENSEIRB engineer diploma in and the Ph.D. Degree in Instrumentation and Measurement from the University of Bordeaux, France, in. After having worked during years for the France Telecom group, he has joined the IXL Laboratory of Microelectronics. He is currently associate Professor at the ENSEIRB School of Electrical Engineering. He is teaching Embedded System conception, Networks and System on Chip. His main research activities include System on Chip for video compression and for Sensor Networks and FPGA testing. Ahmed Ben Atitallah received his Dipl.-Ing and MS degree in electronics from the National Engineering School of Sfax (ENIS) in 00 and 00, respectively and Ph.D. degree in electronics from IMS laboratory, University of Bordeaux in 00. He is currently an assistant professor at Bertrand Le Gal was born in, in Lorient France. He received his Ph.D degree in information and engineering sciences and technologies from the Université de Bretagne Sud, Lorient, France, in 00 and the DEA (MS Degree) in Electronics in 00. He is currently an Associate Professor in the IMS Laboratory, ENSEIRB Engineering School, Talence, France. His research focuses on system design, high-level synthesis, SoCs design methodologies and security issues in embedded devices such as Virtual Component Protection (IPP).

A Novel Deblocking Filter Algorithm In H.264 for Real Time Implementation

2009 Third International Conference on Multimedia and Ubiquitous Engineering A Novel Deblocking Filter Algorithm In H.264 for Real Time Implementation Yuan Li, Ning Han, Chen Chen Department of Automation,