I J S A A. VLSI Implementation for Basic ARPS Algorithm for Video Compression

VLSI Implementation for Basic ARPS Algorithm for Video Compression Jayaprakash.P 1, A.Mallaiah 2, and T.Venkata Lakshmi 3 1 PG Student in ECE Dept.,GEC,Gudlavalleru, 2 Associate Professor in ECE Dept.,GEC,Gudlavalleru, 3 Associate Professor in ECE Dept.,GEC,Gudlavalleru, e-mail: 1 jpsircrr@gmail.com, 2 tvlthota@gmail.com, 3 malli797@gmail.com Abstract--- In this paper, I propose architecture for the Adaptive rood pattern search algorithm. Digital video technology has been characterized by a steady growth in the last decade. With the development of mobile communications and multimedia techniques in recent years, the traditional audio service no longer satisfies the ongoing demand for better communications. New applications like 3G mobile phone video communications; video conferencing and video streaming on the web continuously push for further evolution of research in digital video coding. It is necessary to development of an efficient methodology to estimate the motion field between two frames and design architecture for motion estimation algorithm. Block matching techniques are generally used for motion estimation in video coding. The work also addresses the design of VLSI architecture model for motion estimation is introduced and its quality impact on the overall estimation process is studied. The various levels of parallelism present in current general-purpose architectures are used to efficiently increase the performance of software video coders. In this thesis, we present a high performance hardware architecture for real-time implementation of SAD by using pipelined multi-level SAD calculator. We pay special attention to memory accessing design for optimizing both memory usage and memory bandwidth. The performance of the proposed design in terms of memory bandwidth, throughput, operating cycles and hardware complexity are evaluated and compared with other architectures. This work aims at to development an efficient architecture for ARPS algorithm for motion estimation in order to reduce the computational complexity, so that the search speed can be efficiently improved without sacrificing motion estimation accuracy. Keywords: ARPS, SAD. I. INTRODUCTION A. Overview of ARPS. The Speed and the accuracy of the motion estimation algorithms depends on the size of the search pattern and the magnitude of the target MV, as the small search patterns are useful in detecting small motions but they tend to trap into the local minimum while detecting the large motions, on the other hand the large motion vectors can easily detect the large motions but they tend to go for unnecessary searches when detecting the small motions. Hence it is desirable to use different search patterns according to the estimated motion behavior (in terms of the magnitude of motion) for the current block. This boils down to two issues required to be addressed: 1) How to predetermine the motion behavior of the current block for performing efficient ME? And 2) What is the most suitable size and the shape of the search pattern(s)? Regarding the first issue, in most cases adjacent MBs belonging to the same moving object have similar motions. Therefore the motion vector for the current MB can be reasonably predicted from the neighboring MVs in the spatial or temporal domains. As for the second issue two types of search patterns are used. One is Adaptive Rood Pattern (ARP) with adjustable rood arm, which is dynamically determined for each MB according to the predicted motion behavior. Note that the ARP will be exploited only once in the beginning of the MB search. The objective is to find a good starting point for the remaining local search so as to avoid unnecessary intermediate search and reduce the risk of being trapped into the local minimum in case of long search path. The starting point identified is hopefully as close as to the global minimum as possible. If so then, a small fixed size search pattern will be able to complete the remaining local search quickly. Note also that this small search pattern will be used repeatedly unrestrictedly until the final MV is found. B. Prediction of the Target MV In order to obtain the accurate MV prediction of the current block two factors need to be considered: 1) Choice of the Region Of Support (ROS) that consists of the neighboring blocks whose MVs are used to calculate the predicted MV, and 2) algorithm used to construct the predicted MV. In the temporal region the block in the reference frame at the same position as that of current block in the present frame is a straight forward choice as a temporal ROS candidate. However, the neighboring blocks from the same reference frame can also be used for prediction. However there would be a large requirement of memory if such a kind of operation is performed, as the MV information of the complete reference frame should be stored. So the choice of temporal prediction will be eliminated due to the huge memory requirement and computations. The other way possible is to go for the spatial prediction. Usage of the already calculated i.e. the neighboring blocks MVs as a source for prediction will be a good option. It is the only possible way to have less memory requirement. The concept of Region of Support (ROS) Volume 1, Issue 1, December 2011 11

is used for the prediction of current block MV. There are 4 kinds of ROS possible. They are as in Fig.1. TYPE A ROS covers all the four neighboring blocks and TYPE B is the prediction ROS that is adopted in some international standards such as H.263 for the differential coding of the MVs. TYPE C composed of the two directly adjacent blocks, TYPE D consists of only one adjacent block that is left of the current MB. Experiments on various types of ROCs is being done and it was observed that they yield fairly similar results with a difference of less than 0.1 DB in PSNR and 5% in the number of search points. Hence it is wise to choose TYPE D kind of ROS hence it requires only one motion vector for prediction. C. Selection of Search Patterns 1. Adapative pattern for initial Search: The shape of the rood pattern is symmetrical that is shown in the fig 2. The main structure of ARP takes the rood shape; its size refers to the distance between center point and the any of the other vertex point. The shape of the rood pattern is determined on the basis of real world motion sequences. For most of the sequences it was observed that the motion vector distribution was mostly in horizontal and vertical direction than in other directions, since the camera movements are mostly in those directions. Since the rood pattern spreads in both the vertical and horizontal directions it can quickly detect the motion vectors and also can able to jump directly into the local region of the global minimum. Secondly, any MV can be decomposed into one vertical MV component and one horizontal MV component. For a moving object which may introduce motion in any direction the rood shaped pattern can at least detect the major trend of the moving object which is the desired outcome of the initial search stage. Furthermore ARP s Symmetric shape is advantageous in terms of hardware implementation. As said in the previous section that, even though we use only single MV for prediction, the resulting performance is good when compared with the other kind of ROSs that covers more number of neighboring blocks. It shows that even though the predicted MV is not accurate then also the rood shaped pattern which spreads in the horizontal and vertical directions can still track the major direction and can follow up the refinement process. In addition to the four search points it would be better to include the position of the predicted motion vector, that aids in the termination in the initial search stage only if the predicted MV matches with the target MV. So in total there will be six search points in the initial search stage and then five search points for the further refinement process. The search pattern that will be used in the initial search stage is shown in the fig 2. In this method the Rood Arm Length (RAL) will be equal to the size of the predicted motion vector for the initial search stage, and the four arms are of equal length. Mathematically it can be expressed as follows. The size of the ARP, Ѓ is Ѓ = Round (MV predicted) = Round ([ (MV 2 predicted(x) + MV 2 predicted(y))]) Fig 1: Types Of Region Of Supports Fig 2: Adaptive Rood Pattern 2.Fixed Pattern-For Refined Local Search: In the initial search the adaptive rood pattern directly leads to the new search position which is somewhere around global minimum, which avoids the unnecessary search points in the intermediary search path. Since there is no chance of getting trapped into the local minimum we can use the fixed pattern for identifying the global minimum. The minimum error point in the first step is used to align as the centre of the fixed pattern in the second step. This process will be followed until the point of minimum error is the centre of the present iteration s search pattern. Two types of fixed patterns were proposed. The first one was the 3 3 square pattern as was proposed in the SDSP. The second pattern consists of a unit size rood arm pattern. The experimental results conducted by showed that the 3x3 square pattern yields similar PSNR when compared to the Unit rood arm pattern but 40% to 80% more number of search points. This demonstrates the efficiency of the Unit Rood Arm Pattern. The proposed fixed patterns by are shown in the figure 3. Fig 3 : Fixed size Patterns Volume 1, Issue 1, December 2011 12

D.Summary of the ARPS method Step 1:- Compute the matching error (SAD centre) between the current block and the block at the same location in the reference frame (i.e. centre of the current search window). If the current block is the left most Ѓ = 2; Else Ѓ =Max( MVpredicted(x), MVpredicted(y) ); Go to step 2 Step 2:- Align the centre of ARP with the centre point of the search window and check its 4 points and the position of the predicted motion vector to find the minimum error point. Step3:- Set the centre point of the unit size rood pattern at the minimum error point found in the previous step and checks its points. If the new minimum error point is not incurred at the centre of the unit rood pattern repeat this step otherwise, MV is found corresponding to the minimum error point in the current step. II. ADAPTIVE ROOD PATTERN SEARCH (ARPS) A. Modules in the proposed architecture There are nine main modules in the proposed architecture: 1. Current and Reference video frame storage 2. Initial search processing unit 3. Refined search processing unit 4. Look-up table 5. Motion Predictor storage 6. Reference Block data register 7. Current Block data register 8. Pipelined multi-level SAD calculator 9. SAD comparator and Motion Vector generator The block diagram of the proposed architecture is shown in the Fig.4 These modules contribute to each other and generate motion vectors for each and every macro block in the video frames as outputs that leads to video compression. In the architecture of ARPS motion estimator, there are two main stages for the motion vector search, including the initial and refined searches to generate motion vector. In the initial search stage, the architecture utilizes the previously calculated motion vectors to produce a Motion Vector Predictor (MVP) for the current block. Some initial search points are generated utilizing the Motion Vector Predictor (MVP) and Look-up Table (LUT) to define the search range of adaptive patterns. After a Minimum Motion Estimation (MME) point is found in this stage, the search refinement will take into e ect applying square pattern around Minimum Motion Estimation (MME) points iteratively to obtain a final best MME point, which indicates the final best MV for the current block. For motion estimation, the reference frames are stored in SRAM (or DRAM), while the current frame and produced MVs are stored in dual-port memory (BRAM). Meanwhile, the LUT also uses the BRAM to facilitate the generation of initial search points. Fig 4: Block-level Architecture for ARPS motion estimation The initial search processing unit (ISPU) is used to generate the initial search points and then perform the initial motion search. To generate the initial search points, previously calculated MVs and an LUT are employed. The LUT contains the vertical and horizontal components of the initial search points. Both produced MVs and LUT values are stored in BRAM, for they can be accessed through two independent data ports in parallel to facilitate the processing. When the initial search stage is finished, the refined search processing unit (RSPU) is enabled to work. It employs the square pattern around the MME point derived in initial search stage to refine the local motion search. The local refined search steps might be iteratively performed a few of times, until the MME point is still at the search center after certain refined steps. The ARPS algorithm is designed with low complexity, which is appropriate to be implemented based on hardware architecture. The hardware architecture takes advantage of the pipelining or parallel operations of the search pattern, and utilizes a fully pipelined SAD calculator to improve the computational efficiency and thus reduce the clock rate reasonably. When a fixed square pattern is used to refine the MV search results, the mapping of the memory architecture is important to speed up the performance. In our design, the memory architecture will be mapped onto a 2-D register space for the refinement stage. The maximum size of this space is 16x16 with pixel bit depth, i.e., the mapped register memory can accommodate a largest 16 16 macro block. Basically, the simple combination of parallel register shift operations and related data fetching from SRAM can reduce the memory bandwidth, and thus facilitate the refinement processing, as many of the pixel data for searching in this stage remain unchanged. Volume 1, Issue 1, December 2011 13

1. Current and Reference video frame storage This is one of the main input blocks in the architecture, which stores current and reference video frames. To store these video frames, we use SRAMs which is called as external memory containing both reference frames (previous frame) data and current frames (present frame) data. We can the store these data frames in a 2-D manner, so that we can retrieve easily and also to find out motion vectors easily. We use four counters which are interconnected by a control circuit for the retrieval of the data in the memory unit. They are 1. Column counter(cc) 2. Row counter(rc) 3. Column-Block counter(cbc) 4. Row-Block counter(rbc) The 2-D addresses of the current and reference frame data can be retrieved by using the formula, ADDRESS = [(RC + (RBC 16)) COLUMN_SIZE] + [CC+ (CBC MB_SIZE)] To generate a 16 16 Macro Block addresses, the column counter counts 256 times where as the row counter counts 16 times with respect to column-block counter as well as row-block counter which counts for 1 time each 2. Initial search processing unit In the architecture of ARPS motion estimator, there are two main stages for the motion vector search, including the initial and refined se arch, as shown in the hardware semaphore. In the initial search stage, the architecture utilizes the previously calculated motion vectors to produce an MVP for the current block. Some initial search points are generated utilizing the MVP and LUT to define the search range of adaptive patterns. The initial search processing unit (ISPU) is used to generate the initial search points and then perform the initial motion search. To generate the initial search points, previously calculated MVs and an LUT are employed. The LUT contains the vertical and horizontal components of the initial search points. Both produced MVs and LUT values are stored in BRAM, for they can be accessed through two independent data ports in parallel to facilitate the processing. 3. Refined search processing unit After an MME point is found in this stage, the search refinement will take into effect applying square pattern around MME points iteratively to obtain a final best MME point, which indicates the final best MV for the current block. When the initial search stage is finished, the refined search processing unit (RSPU) is enabled to work. It employs the square pattern around the MME point derived in initial search stage to refine the local motion search. The local refined search steps might be iteratively performed a few of times, until the MME point is still at the search center after certain refined steps. 4. Look-up table (LUT) A Look-Up Table (LUT) is used in the proposed architecture to determine the Rood Arm Length (RAL) which will be equal to the size of the predicted motion vector for the initial search stage, and the four arms are of equal length. From the previous Macro Block (MB) motion vectors, we can determine the search points in the current Macro block (MB) for the initial search of the algorithm. Mathematically it can be expressed as follows. The size of the ARP, Ѓ is Ѓ = (SP X ) 2 + (SP Y ) 2 33 where SPx and SPy are the vertical and horizontal components of a search point on the axis. A look-up table (LUT) is employed, as listed in Table1. The values of SPx and SPy are predefined according to the size of the rood arm. A look-up table for the definition of vertical and horizontal components of initial search points is as follows. Table1:Initial Search points defined by LUT 5.Motion Predictor Storage In this module, the motion vectors that are obtained for the current macro block are stored which plays an important role in the generation of the Rood Arm Length (RAL). According to the position of minimum search point of the previous Macro Block, we consider a pel such that it assumes to be minimum search point and calculates the SAD value with respect to the centre of the current Macro Block. After calculating the minimum search point and corresponding motion vector, the SAD comparator sends the motion vector and the corresponding minimum SAD value point to the Motion Predictor Storage. This module is controlled by the control unit which generates enable signals for the accurate operation of the module. Volume 1, Issue 1, December 2011 14

6. Reference Block Data Register When a square pattern is used to refine the MV search results, the mapp ing of the memory architecture is important to speed up the performance. In our design, the memory architecture will be mapped onto a 2D register space for the refined stage. The maximum size of this space is 18 18 with pixel bit depth, that is, the mapped register memory can accommodate a largest 16 16 macroblock plus the edge redundancy for the rotated data shift and storage operations. A simple combination of parallel register shifts and related data fetches from SRAM can reduce the memory bandwidth, and facilitate the refinement processing, as many of the pixel data for searching in this stage remain unchanged. For example, 87.89% and 93.75% of the pixel data will stay unchanged, when the (1, 1) and (1, 0) o set searches for the 16 16 block are executed, respectively. 7. Current Block Data Register This is a 16 16 register which can store 16 16 size macroblock data of the current frame sequence. The data that is retrieved from the external memory i.e. current frame macroblock data is stored and sent to the pipelined SAD calculator for calculating different SAD values with in the search area. The memory used for the current block data register is Block Random Access Memory (BRAM) which is in the form of blocks wise memory for the controlled usage of the memory. By using the BRAMs, we can save a lot of memory compared to SRAMs. We can use the same current macroblock data for for different reference macroblocks in the initial as well as refined searches within the search area tho get the best matching block. That s why; BRAM is suitable for current block data register for storing current macroblock data which is in the current frame sequence. 8. Pipelined multi-level SAD calculator As main ME operations are related to SAD calculations that have a critical impact on the performance of hardware-based motion estimator, a fully pipelined SAD calculator is designed to speed up the SAD computations. Figure displays a basic architecture of the pipelined SAD calculator, with the processing support of variable block sizes. According to the VBS indicated by block shape and enable signals, SAD calculator can employ appropriate parallel and pipelining adder operations to generate SAD result for a searched block. With the parallel calculations of basic processing unit (BPU), it can take 4 clock cycles to finish the 4 4 block SAD computations (BPU for 4 4 block SAD), and 8 clock cycles to produce a final SAD result for a 16 16 block. To support the VBS feature, di erent block shapes might be processed based on the prototype of the BPU. In such case, a 16 16 macroblock is divided into 16 basic 4 4 blocks indicates each parallel pixel data input with the current and reference block data. 9. SAD comparator and Motion Vector generator The SAD comparator is utilized to compare the previously generated block SAD results to obtain a final estimated MV which corresponds to the best MME point that has the minimum SAD with the lowest block pixel intensity. To select and compare the proper block SAD results, the signals of di erent block shapes and computing stages are employed to determine the appropriate mode of minimum SAD to be utilized. For example, if the 16 16 block size is used for motion estimation, the 16 16 block data will be loaded into the BPU for SAD calculations. Each 16 16 block requires 4 computing stages to obtain a final block SAD result. In this case, the result mode of 16 8 or 16 16 SAD will be first selected. Meanwhile, the signal of computing stages is also used to indicate the valid input to the SAD comparator for retrieving proper SAD results from BPU, and thus obtain the MME point with a minimum SAD for this block size. The best MME point position obtained by SAD comparator is further employed to produce the best matched reference block data and residual data which are important to other video encoding functions, such as mathematical transforms and motion compensation. To maximize the computational throughput, the SAD Memory is implemented on-chip through a Dual Port RAM with a concurrent management of the two ports, SADin and SADout. Fig 5: SAD comparator and Motion Vector generator III.RESULTS A. Simulation Results The proposed architecture has been simulated using in MATLAB 7.2 version and compared with the Diamond search, Three step search and Full search. Comparison metrics used were PSNR and number of search points. Test Frames are some of the HDTV frames (720p and 1080p). Number of predicted frames are 14. The comparison is as follows. Volume 1, Issue 1, December 2011 15

Table 2 Comparison by Average PSNR (in DB) as metric C. Hardware Simulation The HDL used for the proposed architecture is VERILOG. The Synthesis tool is Xilinx XST tool. The ISE tool is Xilinx ISE 11.1 version. Table 3 Comparison by Average Number Of Search Points per block as metric Fig 6: Simulation Result of the Processing Element Architecture Table 4 Comparison by Total encoding time & Total ME time for sequence. The Input File used is 40.yuv B. Simulation Graph Results (Input-parkrun.yuv) (Foreman_CIF Input) IV.CONCLUSION and FUTURE WORK 1. CONCLUSION ARPS was better in performance in terms of search points when compared to different Block Matching Algorithms like TSS, DS and FS. ARPS was better than TSS in terms of PSNR and almost similar performance when compared to DS and FS which was shown by simulation results. By using multi-port memories and sub-module memories, we can speed up the working of the algorithm i.e. number of clock cycles are reduced. From the FPGA device utilization summary it can be found that the motion estimation block utilizes very less resources, which gives a chance to develop the entire encoder on the single FPGA. The structural design of the VLSI architecture for the ARPS algorithm shows that the actual complexity involved in the motion estimation algorithms. 2. FUTURE WORK The computationally intensive nature of most of the Block-Matching Algorithms and the demand of realtime processing render the VLSI implementation of motion estimation algorithms is a necessity. For high quality video or in applications where a powerful processor is not available, a hardware implementation is the solution. In this project, a VLSI architecture for ARPS algorithm is presented to compute the motion vectors required by the H.264/AVC video coding standard. The proposed architecture is easily scalable and parallel implementations can be efficiently realized to obtain higher speed with a reasonable increase in hardware resources requirement. The architecture attains low power, low memory bandwidth, 100% hardware utilization and high throughput while supporting all the block sizes specified by H.264.By the slight modifications to the proposed architecture, we can achieve the real time performance. The proposed architecture can be modified by using the sub-module memories which are useful in pipelining process and also by increasing the number of processing elements as well as SAD calculators that can be used in the real time video encoders which is targeted for HDTV Volume 1, Issue 1, December 2011 16

applications. The proposed architecture is targeted to FPGA, it can be extended to ASIC implementation also. V.REFERENCES [1] MPEG, IS0 CD11172-2; Coding of moving pictures and associated audio for digital storage media at up to about 1.5 Mbits/s, Nov. 1991. [2] ISO/IEC 11 172-2 (MPEG-1 Video), Information technology-coding of moving pictures and associated audio for digital storage media at up to about 1.5 Mbit/s: Video, 1993. [3] ISO/IEC 13818-2 I ITU-T H.262 (MPEG-2 Video), Information technology-generic coding of moving pictures and associated audio information:video, 1995. [4] CCITT SGXV, Description of reference model 8 (RM8), Document 525, Working Party XV/4, specialists Group on Coding for Visual Telephony, Jun. 1989. [5] Y.Nie and K-k Ma, Adaptive Rood Pattern Search for fast block matching algorithms, IEEE Trans. on Image Processing, vol 11, no. 12, pp. 1442-1449, DECEMBER2002. [6] Zhongli He, Ming L. Liou, Philip. C. H. Chan, and R. Li, An Efficient VLSI Architecture for New Three-Step search Algorithm, 38th Midwest Symposium on Circuits and Systems, August 1996, pp. 1228-1231. [7] B.K.N.S. Rao, S.K. Chatterjee and I. Chakrabarti, Low Power VLSI Architecture for FTSS algorithm, International coferencec on RF and Signal processing systems, Feb 2008, pp. 286-291. [10] A. N. Netravali and J. D. Robbins, Motion compensated television coding: Part-I, Bell Syst. Tech. J., vol. 58, pp. 631 670, Mar. 1979 [11] [11]Lai-Man Po and Wing-Chung Ma, A Novel Four-Step Search Algorithm for Fast Block Motion Estimation, IEEE Transactions on Circuits and Systems for Video Technology, vol. 6, no. 3, pp.313-317, June 1996. [12] T. Koga, K. Iinuma, A. Hirano, Y. Iijima, and T. Ishiguro, Motion compensated interframe coding for video conferencing, in Proc. Nat. Telecommun. Conf., New Orleans, LA, Nov. 29 Dec. 3 1981, pp. G5.3.1 G5.3.5. [13] Jong-Nam Kim and Tae-Sun Choi, A Fast Three Step Search Algorithm with Minimum Checking Points, Proc. of IEEE conference on Consumer Electronics, pp.132-133, 2-4 June 1998. [14] Renxiang Li, Bing Zeng, and Ming L. Liou, A New Three- Step Search Algorithm for Block Motion Estimation, IEEE Transactions on Circuits and systems for Video Technology, Vol.4, no.4, pp.438-442, August 1994. [8] Th. Zahariadis and D. Kalivas A Spiral Search Algorithm For Fast Estimation Of Block Motion Vectors, Signal Processing VIII, theories and applications.proceedings of the EUSIPCO 96. Eighth European Signal Processing Conference p.3 vol. lxiii + 2144, 1079-82, vol. 2. [9] Jaswant R. Jain and Anil K. Jain, Displacement Measurement and Its Application in Interframe Image Coding, IEEE Transactions on Communications, VOL. COM-29, NO. 12, December 1981. Volume 1, Issue 1, December 2011 17