Fast frame memory access method for H.264/AVC Tian Song 1a), Tomoyuki Kishida 2, and Takashi Shimamoto 1 1 Computer Systems Engineering, Department of Institute of Technology and Science, Graduate School of Engineering, Tokushima University, Minami-Jyosanjima 2 1, Tokushima City, 770 8506, Japan 2 Department of Electrical and Electronic Engineering, Graduate School of Engineering, Tokushima University, Minami-Josanjima 2 1, Tokushima City, 770 8506, Japan a) tiansong@ee.tokushima-u.ac.jp Abstract: This paper presents an efficient memory access interface architecture for H.264/AVC encoder. In the implementation of H.264/AVC encoder, the bandwidth compression of frame memory becomes a challenging issue due to some bandwidth intensive coding tools, such as multiple frames motion estimation, deblocking filter and IN- TRA mode decision. In this work, by analyzing the memory access patterns of each coding function module of H.264/AVC, an efficient memory access method for the Direct Memory Access (DMA) module is proposed. The proposed method carefully designed an efficient memory mapping method to decrease the memory response delay. Simulation results show that over 50% memory access cycles can be saved by using proposed method. Keywords: H.264/AVC, VLSI, SDRAM, bandwidth compression Classification: Integrated circuits References [1] Joint Video Team of ISO/IEC MPEG and ITU-T VCEG, Draft ITU- T recommendation and final draft international standard of joint video specification (ITU-T Rec. H.264-ISO/IEC 14496-10 AVC), March 2003. [2] U. Bayazit, L. Chen, and R. Rozploch, A novel memory compression system for MPEG-2 Decoders, Proc. IEEE Int. Conf. Consum. Electron. (ICCE), pp. 56 57, 1998. [3] J. Tajime and Y. Miyamoto, A frame memory compression method for H.264 decoders, IEICE general conf., D-11 35, March 2006. [4] P. Zhang, W. Gao, D. Wu, and D. Xie, An efficient reference frame storage scheme for H.264 HDTV decoder, Proc. Int. Conf. Multimedia & Expo, pp. 361 364, July 2006. [5] H. Kim and I. C. Park, High-performance and low-power memoryinterface architecture for video processing applications, IEEE Trans. Circuits Syst. Video Technol., vol. 11, no. 11, pp. 1160 1170, Nov. 2001. 344
1 Introduction H.264/AVC [1] which can achieve high coding efficiency at variable bit rate is used in a variety of practical applications. H.264/AVC inherits the MC-DCT based hybrid structure which is also recommended by some other traditional standards. It employs the inter frame prediction and integer DCT to reduce the redundancy of high frequency gradient. Addition to these traditional algorithms, H.264/AVC introduces several new coding tools by which can highly improve the coding efficiency. In these coding tools the exhausted precoding process, named rate-distortion optimization (RDO), takes over 80% of the total computation complexity. The RDO process performs multiple frames referenced motion estimation, 1/4 pixel precision motion estimation, deblocking filter and 13 types INTRA mode coding. However, along with high coding gain, these new coding tools drastically increase the computation complexity as well as memory bandwidth. In a typical hardware implementation of H.264/AVC encoder, reference frames are temporarily saved in external frame memory, commonly in SDRAM. When macroblocks are encoded one by one, coding function modules access the SDRAM to read the current and reference macroblock data for each macroblock. In order to realize realtime encoding for H.264/AVC applications with full HD resolution, about 4.6 5 GB/s bandwidth is necessary. However, when using current DDR3 technology only 3.2 GB/s can be achieved. With the increasing demands of H.264/AVC applications, the memory interface solution becomes an important research issue. Some approaches which concentrate on data compression to cut down frame memory consumption are proposed [2, 3]. However, these proposed methods cut down the memory consumption at the sacrifice of the image quality such as a simple 5-bits quantization. A memory mapping approach which arranges the pixel data access for sub-pixel data have been proposed [4]. However, this proposal will increase the memory consumption. Another study about memory address generation has been proposed to optimize memory interface [5]. However, this proposed method is not suitable for H.264/AVC. In this paper, considering the features of frame memory access patterns of function modules, we introduce a novel Direct Memory Access (DMA) for H.264/AVC. 2 Features of memory access patterns and SDRAM command Considering a macroblock-order based encoding engine of H.264/AVC, many coding function modules may require pixel data of current or reference macroblock from frame memory. Many coding functions have to be performed to each macroblock one by one due to the correlation between the adjacent macroblocks. Based on data access features of each coding function module, we classify all memory access patterns to two groups: ME and MC groups. Motion estimation module, including integer pixel, sub-pixel motion estimation, and multiple frames motion estimation, always access the frame memory for reference pixels in a certain search range. We classify these memory acc IEICE 2008 345
Fig. 1. Data access request patterns of ME and MC cess patterns into ME modules group. Another three function modules, INTRA mode decision, motion compensation, and deblocking filter modules, always access the left or the upper macroblock to current coding macroblock to read the reference data. We classify these memory access patterns into MC modules group. The pixel data request patterns for the ME and MC modules groups are shown in Fig. 1. The function modules of ME perform motion estimation in a certain search range (typically ±16). After the motion estimation process for one macroblock, reference data for next macroblok need to be read from frame memory. As shown in Fig. 1, the memory request patterns for ME are typically four macroblocks, which located right or under the current search range. On the other hand, the memory request patterns for MC modules are always the pixel data of current macroblock and one encoded macroblock that located left or up to the current macroblock. These required pixel data need to be read out from SDRAM with no response delay. A typical SDRAM access control can be described as several continuous command generations. First, the bank of the SDRAM and line address are generated, followed by the low address generation. Using burst mode of SDRAM, multiple words in the same line address can be accessed in continuous cycles. However, when the required data are saved in different lines, the line address command has to be updated. This line address update will induce access delay of several cycles. To avoid this memory response delay, all the required data have to be mapped in the same line address. If the required data are mapped in different banks, line address for different banks has to be updated in advance to conceal the response delay [5]. 3 Proposed method In this work, a memory mapping method on the basis of the memory access patterns to reduce memory access delay is proposed. In this work, a typical SDRAM with 4 banks, data width of 32 bits, and 256 words in one line is used. As discussed in the previous section, pixel data of four adjacent macroblocks need to be read out from frame memory for ME modules. The pixel 346
Fig. 2. Proposed memory mapping method data of current and the upper coding macroblock are also need to be read out for MC modules. To realize no delay access of continuous pixel data, we proposed a memory mapping method. The proposed frame memory mapping method is shown in Fig. 2. As shown in Fig. 2, four continuous macroblocks in each line are collected as a group. Continuous groups in one line are mapped into different banks and the groups in adjacent line are mapped in different banks. This mapping method can realize no delay access. A0, B0, C0, D0, which are shown in the Fig. 2 indicate four macroblocks groups in which four macroblocks are included. The four macroblock groups are saved in different banks (Bank A, B, C, D). In this case, any group can be read out by ME modules without access delay, because all of the four macroblocks in each group are located in the same address line. In the case of B1, D1, B2, D2 pattern, all of the four macroblocks are mapped in two different banks, no access delay will occur. In the case of A4, B4 pattern, these reference data can be read out for ME without access delay, because the pixel data of first two macroblocks (A4) are mapped in the same line, and the second two macroblocks (B4) are mapped in different banks. On the other hand, for the access pattern of MC, a typical sample such as B3, C3, D3 pattern will not induce any access delay because B3 and D3 are mapped in difference banks and the C3, D3 are also mapped in different banks. When the C3, D3 are mapped in the same bank, it will also not induce access delay because they are mapped in the same line. Proposed memory mapping method is suitable for almost hardwareoriented algorithms except for those algorithems with random search patterns. 4 Simulation Results Proposed method is evaluated from the viewpoint of memory access cycle reduction. The DMA has to response to the MC and ME modules respectively. Due to the random data length of ME module, the memory bandwidth 347
Table I. Access requests reduction of the proposed method QCIF CIF VGA HDTV720p HDTV1080i Previous mapping 1,282 5,334 16,422 49,672 113,146 Proposed mapping 610 2,410 7,258 21,688 49,094 Reduction rate (%) 52.4 54.8 55.8 56.3 56.6 reduction is difficult to be evaluated directly. In this work, the average reduction number of SDRAM access request is used to evaluate the memory bandwidth. A dummy ME module which emulates the frame memory access patterns for motion estimation and a dummy MC module that emulates the frame memory access patterns for INTRA mode decision and deblocking filter are described by Verilog-HDL, as well as a DMA module. Depending on the motion estimation algorithms, the access timing and request numbers may be different. Therefore, random number is used for numbers of the required macroblock for ME at random intervals. The method that all the pixel data are saved in the same bank are defined as previous mapping method. The comparison results between the previous method and the proposed method are shown in Table I. As shown in Table I, compared with the previous method the proposed mapping method can cut down over 50% access requests. Furthermore, the proposed method can realize stable cycle reduction rate at any bitrate. 5 Conclusion In this paper an efficient memory mapping method and an embedded SRAM method are proposed to realize efficient bandwidth compression for H.264/AVC encoder. Proposed method analyzed the memory access patterns of each coding function module of H.264/AVC and classified all these function modules to two groups, namely two access patterns. Then we proposed an efficient memory mapping method by which can achieve no delay memory access. Using this method, proposed architecture can save over 50% access cycles than previous method. Proposed method has been verified to be efficient memory access methods for typical H.264/AVC-dedicated hardware encoder implementations. Cut down the memory access frequency will directly help to realize efficient memory bandwidth compression. However, how much these two proposed methods can save the total bandwidth of H.264/AVC depend on the implementation method of the appropriate motion estimation, INTRA mode selection, and deblocking filter modules. 348