Speed-area optimized FPGA implementation for Full Search Block Matching

Speed-area optimized FPGA implementation for Full Search Block Matching Santosh Ghosh and Avishek Saha Department of Computer Science and Engineering, IIT Kharagpur, WB, India, 721302 {santosh, avishek}@cseiitkgpernetin Abstract This paper presents an FPGA based hardware design for Full Search Block Matching (FSBM) based Motion Estimation (ME) in video compression The significantly higher resolution of HDTV based applications is achieved by using FSBM based ME The proposed architecture uses a modification of the Sum-of-Absolute-Differences (SAD) computation in FSBM such that the total number of additions/subtraction operations is drastically reduced This successfully optimizes the conflicting design requirements of high throughput and small silicon area Comparison results demonstrate the superior performance of our architecture Finally, the design of a reconfigurable block matching hardware has been discussed 1 Introuction Rapid growth in High-Definition (HD) digital video applications has lead to an increased interest in portable HDquality encoder design HD-compatible MPEG2 MP@HL encoder uses Full Search Block Matching Algorithm (FS- BMA) based Motion Estimation (ME) The ME module accounts for more than 80% of the computational complexity of a typical video encoder Moreover, the power consumption of an FSBM-based encoder is prohibitively high, particularly for portable implementations Hence, efficient ME processor cores need to be designed to realize portable HDTV video encoders Parameterizable FSBM ASIC design to solve the input bandwidth problem by using on-chip line buffers was proposed in [15] [18] proposed a family of modular VLSI architectures which allow sequential inputs but perform parallel processing with 100 percent efficiency A systolic mapping procedure to derive FSBM architectures was proposed in [4] The designs of ([2], [20]) and [5] focused on the reduction of pin counts by sharing memory units and 2- dimensional data reuse, respectively [19] improved the memory bandwidth by using an overlapped data flow of search area which increased the processing element (PE) utilization A low-latency high-throughput tree architecture for FSBM was proposed in [3] Both [13] and [1] proposed low-power architectures based on removal of unnecessary computations Finally, a novel low-power parallel tree FSBM architecture was proposed in [6], which exploited the spatial data correlations within parallel candidate block searches for data sharing and thus effectively reduces data access bandwidth and power consumption [7] proposed an FPGA architecture to implement parallel computation of FSBM Systolic array and novel OnLine Arithmetic (OLA) based designs for FSBM were proposed in [8] and [9], respectively Customizable low-power FPGA cores were proposed by [10] [11] evaluated the performance of FSBM hardware architectures [4] implemented on Xilinx FPGA The results show that, real-time motion estimation for CIF (352 288) sequences can be achieved with 2-D systolic arrays and moderate capacity (250 k gates) FPGA chip An adder-tree based 16 1 SAD FPGA hardware was implemented by [17] The aforementioned FSBM architectures can be divided into two categories, namely, FPGA [7, 8, 9, 10, 11, 17] and ASIC [4, 15, 18, 2, 3, 20, 5, 19, 13, 1, 6] This work uses FPGA technology to implement a high-performance ME hardware with due consideration to (a) processing speed and (b) silicon area Almost all aforementioned VLSI architectures optimize any one of these parameters The novelty of the proposed architecture lies in its combined optimization of the aforementioned conflicting design requirements The proposed hardware uses an initially-split pipeline to reduce processing cycles for each MB and thus increases the throughput In addition, this design requires less number of adders and only one Absolute Difference (AD) PE, which drastically reduces the silicon area when compared to other existing designs The pixels of the search regions have been organized in memory banks such that two sets of 128-bit (16 8-bit pixels) data can be accessed in each clock cycle Section 2 gives an overview of FSBM-based motion estimation Section 3 presents a brief discussion on SAD modifications and describes the proposed FSBM hardware The implementation and comparative results have been presented in Section 4 Section 5 presents a reconfigurable address generator Finally, Section 6 concludes this paper 1-4244-1258-7/07/$2500 2007 IEEE 13

2 FSBM-based Motion Estimation Motion-compensated video compression models the pixel motion within the current picture as a translation of those within a previous picture The motion vector is obtained by minimizing a cost function measuring the mismatch between the current MB in current frame and the candidate block in reference frame SAD, the most popular cost function, between the pixels of the current MB x(i, j) and the search region y(i, j) can be expressed as, SAD(u, v) = x(i, j) y(i u, j v) (1) where, (u, v) is the displacement between these two blocks Thus, each search requires N 2 absolute differences and (N 2 1) additions The FSBMA exhaustively evaluates all possible search locations and hence is optimal in terms of reconstructed video quality and compression ratio High computational requirements, regular processing scheme and simple control structures make the hardware implementation of FSBM a preferred choice Table 1: Execution profile of a typical video encoder SAD ME/ DCT/ Q/IQ VLC/ Others MC IDCT VLD 7228% 1685% 617% 235% 145% 032% The execution profile of a standard video encoder obtained using the GNU gprof tool has been shown in Table 2 The table shows that motion estimation is the most computationally expensive module in a typical video encoder In addition, SAD computations take the maximum time due to complex nature of absolute operation and subsequent multitude of additions 3 Proposed FSBM Architecture In this section we delineate our proposed speed-area optimized FSBM architecture The first subsection briefly explains the SAD modification and the MB searching technique The subsequent subsections describe the proposed hardware and the memory organization 31 SAD modification This section presents a modification to SAD computation The SAD expression in Eq 1 can be re-written as, SAD(u, v) x(i, j) y(i u, j v) (2) The detailed proof of the above derivation can be found in [12] Again, it can be posited that, if, then, x(i, j) y(i u, j v) SAD min SAD(u, v) SAD min (by Eq 2) (3) where SAD min denotes the current minimum SAD value Thus, if Eq 3 is satisfied, then the SAD computation at the (u, v) th location may be skipped In addition, if X(u, v) be the sum of pixel intensities at the (u, v) th MB location, then this sum can be derived from X(u 1,v) by subtracting and adding the intensity sum of columns at specific positions Based on this fact, [12] proposes a search strategy to efficiently derive and compute the MB sums at successive locations The MB search technique used in our proposed design adopts this particular approach 32 Pipelined SAD Operator The SAD hardware for FSBMA has been divided into eight independent sequential steps It computes the initial full SAD for the first Search Location (SL) and derives the SAD sums for subsequent SLs Fig 1 shows the data path of the proposed SAD operator for N =16 Stages 1 to 4 of the proposed design have been split to facilitate parallel processing Each half-stage (from Stage1 to Stage 4) computes the sum of 16 pixel values per clock cycle These partial sums are accumulated in SR and MB registers of Stage 6 Initially, the SR and MB registers of Stage 6 are initialized to 0 For the first SAD calculation, Stage 5 just passes the intermediate addition result of Stage 4 to Stage 6 This can be achieved by setting the S 0 control signal of Stage 6 to 0 Thus, the SAD sum of the candidate MB and the first SL can be computed in 6 (for the six stages of the pipeline) 15 (to add 16 values) = 21 cycles Thereafter, for every subsequent SL, the right and the left half-stages add the pixel intensities of the old and new rows/coloumns, respectively At this point, Stage 5 is activated by enabling the S 0 control signal This stage differentiates the resultant sum of the two half-stages and accumulates the result in SR register of Stage 6 Stage 7 computes the AD between the older MB sum and the newly obtained SL sum Finally, Stage 8 compares the new SAD with the existing SAD min and stores the minimum SAD sum obtained so far Thus, at each clock cycle, the proposed pipelined architecture computes one new SAD value and stores the minimum SAD Hence, with a search region size of p =16, this hardware can search the best match for an MB in only [(2p 1) 2 1] 23 clocks = 1111 clock cycles 14

Pipeline Stages (1) (2) (3) (4) _ (5) SR 0 1 S 0 (6) MB AD (7) SAD a < b 1 0 (8) Figure 1: Data path of different pipeline stages of the proposed SAD unit 33 Memory Organization Our design adopts the MB scanning technique proposed in [12] The pixels in p =16search region are represented by P i,j where 0 i 48 and 0 j 48 (shown in Fig 3)) This search region has (2p 1) 2 =33 2 = 1089 search locations row number 1 2 3 column number 1 2 3 48 P1,1 P1,2 P1,48 P2,1 P2,2 P2,48 P48,1 P48,2 P48,48 48 Figure 3: Position of Pixels in the search region Initially, the sum of the first search location is computed by 16 16 j=1 i=1 P i,j equation Thereafter, to move towards left or right the oldest column of the pervious search location is subtracted from one new column in the new search location This implies that, at every clock, we need to access two 128-bit (16 8) data from the memory These 128-bit data are basically represented as a part of one column in the search region (Fig3), eg, [P 1,1,P 2,1,P 3,1,, P 16,1 ]is one such 128-bit data, which belongs to the column 1 of the search region It is observed that the one of the columns from column number 17 to 32 are accessed concurrently with another column from rest of the columns, ie, 1 to 16 and 33 to 48, in the pre-defined search region Therefore, the pixels have been organized in two different memory banks, as shown in Fig 2 The data in these memory banks are organized in column major format so that the whole column can be accessed by a single memory access The memory controller generates the right address at every clocks for both the memory banks The selected 384 bits (48 pixels of a single column of Fig3) of each bank are then multiplexed and the correct 16 pixels are passed onto the SAD processing unit When the search location is moved down from the previous position, then we need to access two set of row pixels This is not possible by the previously organized memory banks in one clock It is easily observed Fig 3 that either the first 16 pixels or the last 16 pixels of a single row have to be accessed for this purpose It is also to be observed that, for the even row number, the first 16 15

column number 1 2 3 16 33 48 row number 1 2 3 48 P1,1 P2,1 P16,1 P32,1 P48,1 P1,2 P2,2 P16,2 P32,2 P48,2 P1,3 P2,3 P16,3 P32,3 P48,3 P1,16 P2,16 P16,16 P32,16 P48,16 P1,33 P2,33 P16,33 P32,33 P48,33 P1,48 P2,48 P16,48 P32,48 P48,48 1 2 3 16 1 P 1,33 P 1,34 P 1,48 2 P 2,1 P 2,2 P 2,16 3 P 3,33 P 3,34 P 3,48 16 P 16,1 P 16,2 33 P 33,33 P 33,34 P 16,16 P 33,48 RB1 RB3 48 P 48,1 P 48,2 P 48,16 column number 32 row number 1 2 3 48 17 P1,17 P2,17 P16,17 P32,17 P48,17 18 P1,18 P2,18 P16,18 P32,18 P48,18 19 P1,19 P2,19 P16,19 P32,19 P48,19 17 18 19 32 P1,32 P2,32 P16,32 P32,32 P48,32 (c) 1 2 3 16 P17,33 P17,34 P17,48 P18,1 P18,2 P18,16 P19,33 P19,34 P19,48 RB2 P32,1 P32,2 P32,16 (a) (b) (d) Figure 2: Organization of pixels in [(a),(c)] column major/[(b),(d)] row major format that are added or subtracted during the shift of search in left or right/down locations, respectively (c) and (d) represent the corresponding 2 nd column/row memory banks that are independent of the 1 st column/row memory banks shown in (a) and (b), respectively (P i,1,p i,2,, P i,16 when i is even) and for the odd row the last 16 (P i,33,p i,34,, P i,48 when i is odd) pixels are accessed to handle the downward movement of the search location Hence, we have stored the required row values in another two memory banks One is 32 128 bit,tostore 32 such row pixel sets and the another one is 16 128 bit, to store 16 such row pixel sets Thus, the design needs only 768 bytes of overhead memory The organization of this memory banks and the stored pixels are shown in Fig 2 In order to reduce the total number of memory accesses in FSBM-based architecture, data reuse can be performed [14] at four different levels Our on-chip memory bank organization technique adopts the data reuse defined as Level A and Level B Level A describes the locality of data within the candidate block strip where the search locations are moving within the block strip Level B describes the locality among the candidate block strips, as vertically adjacent candidate block strips are overlapped In our design this memory organization primarily based on the usage of Look Up Tables (LUT) in the FPGA implementation 4 Performance Analysis This section presents the implementations results of the proposed hardware Subsequently, it compares the obtained results with other exiting FPGA based designs 41 Implementation Results The proposed design has been implemented in Verilog HDL and verified with RTL simulations using Mentor Graphics ModelSim SE The Verilog RTL has been synthesized on a Xilinx Virtex IV 4vlx100ff1513 FPGA The synthesis results show that design requires 333 CLB Slices, 416 DFFs/Latches and a total of 278 input/output pins The area of the implementation is 380 look-up tables (LUTs) and the highest achievable frequency is 221322 MHz The pipelined design takes 23 clock cycles to produce the first SAD value Thereafter, one SAD value is generated in every cycle A search range of p =16has (2p 1) 2 = 1089 search locations So for a search range of p =16,the number of cycles required by our hardware to find the best matching block is, 23 (for the first search location) (1089-1) (for the remaining search locations) = 1111 cycles Our FPGA implementation works at a maximum frequency of 221322MHz (452 ns clock cycle) Hence, the FPGA implementation can process a MB (16x16) in 5022 usec (1111 clock cycles per MB * 452 ns per clock cycle = 5022 usec) and a 720p HDTV (1280x720) frame in 18078 msec (3600 MBs per frame * 5022 usecs per MB = 18078 msec) At this speed, the proposed hardware can process 5533 720p HDTV frames per second This is a big improvement over other approaches, where the frames processed per second is much lower This is evident from Table 2 The high speed and throughput of our design is mainly because of the modified SAD operation and the split pipeline design of the proposed architecture 42 Performance Comparison This subsection compares the hardware features and performance of the proposed design with existing FPGA architectures No comparison has been made with available ASIC solutions 16

Table 2: Comparison of hardware features and performance with N=16 and p=16 Feature-based comparison Performance Design cycles Freq CLB Input AD Adders Comp HDTV Through- T /MB (MHz) Slices Ports PEs 720p put (T) Area (fps) (MBs/sec) Loukil et al [7] 8482 1038 1654 48 33 33 8-bit 17 34 122377 74 (Altera Stratix) Mohammad et al [8] 25344 1910 300 2 33 33 8-bit 34 209 75363 251 (Xilinx Virtex II) Olivares et al [9] 27481 3668 2296 2 256 510 1-bit 1 371 133474 58 (Xilinx Spartan) Roma et al [10] 2800 761 29430 3 256 15 8-bit 1 755 271786 092 (Xilinx XCV3200E) Ryszko et al [11] 1584 300 948 16 256 16 8-bit 1 526 189394 119 (Xilinx XC40250) Wong et al [17] 45738 1970 1699 32 16 243 8-bit 1 12 43071 25 (Altera Flex20KE) Our 1111 221322 333 256 1 16 8-bit, 8 1 5533 1992097 5982 (Xilinx Virtex IV) 9-bit, 4 10- bit, 3 11-bit & 2 16-bit Table 41 compares the hardware features of the proposed and existing FPGA solutions for a macroblock (MB) of size 16 16 and a search range of p = 16 As can be seen, our design consumes less cycles per MB, has the highest maximum operating frequency The splitting of the initial stage of the pipeline facilitates this high speed The area required in terms of CLB slices and the hardware complexity in terms of AD PEs (Absolute Difference Processing Elements), adders and comparators are much lesser for the proposed architecture Modification of the SAD operation contributes to the high speed and less area and hardware complexity The use of memory banks has led to higher on-chip bandwidth However, this has also led to the only drawback of our design, which is the high number of input/output pins A performance comparison of the various architectures has been also shown in Table 41 In order to compare the speed-area optimized performance of different architectures, the new performance criteria of throughput/area has been used Higher the throughput/area parameter of a design, more is the speed-area optimization of the architecture The architectures have been compared in terms of (a) number of HDTV 720p (1280x720) frames that can be processed per second, (b) throughput or MBs processed per second, (c) throughput/area, and (d) the I/O bandwidth As can be seen, the proposed design has a very high throughput and can process the maximum number of HDTV 720p frames per second (fps) Moreover, the superior speed-area optimization in the proposed design is exhibited by its substantially high throughput/area value of 5982 5 Reconfigurable Block Matching Hardware Apart from using the full pattern, block matching can also be performed by using N-queen decimation patterns It has been shown [16] that the N-queen patterns have similar PSNR drop but yield much faster encoding performance as compared to the full pattern, particularly for N =4and N =8 This section presents a reconfigurable hardware design to find the minimum SAD value by selecting any one of the full-search, 8-queen or 4-queen decimation techniques To the best of our knowledge no similar hardware design exists in literature For both 4-queen and 8-queen decimation techniques, the pixels being processed for two consecutive SAD-based block matching are mutually independent This fact can be utilized to further enhance the performance of the SAD operator discussed in section 3 Only the memory organization and the address generation at each clock will differ for the three decimation patterns It has been observed that the reconfigurable address generator and SAD operator require only 40% and 2% extra hardware cost, respectively, as compared to the already proposed full pixel architecture The reconfigurable address generator uses a common datapath Two consecutive addresses are represented by their respective bit value differences For each decimation technique, the bit value is toggled following some predefined patterns Bit toggling of the 8-bit address lines are 17

controlled by their respective enable signals which are being generated by one special controller logic This state machine based controller generates the respective enable signals depending on 2-bit decimation mode select input signals The pipelined datapath shown in Fig 1 can also be reconfigured according to the user specified decimation mode In case of 8-queen on 16 16 block size, 32 pixel values are added at every clock by both halves of the pipe stages from one to five The resultant value is directly used to perform absolute difference with the MB to calculate current SAD value The same datapath of the pipelined SAD operator also performs the SAD calculation for 4-queen decimation This technique requires 64 pixels for each SAD value for 16 16 block size So, the pipeline is reconfigured in a way such that its both halves from stage one to five and stage six are used to perform the addition of these 64 pixel values Subsequently, it performs sum of absolute differences to get the new SAD 6 Conclusions This paper has presented a FPGA based design for Full Search Block Matching Algorithm The novelty of this design lies in its modified SAD calculation and in splitpipelined design for parallel processing in the initial stages of the hardware The macroblock search scan has also been suitably altered to facilitate the derivation of SAD sums from previously computed results Compared to existing FPGA architectures, the proposed design exhibits superior performance in terms of high throughput and low hardware complexity The high frame processing rate of 5533 fps makes this design particularly useful in both frame and field processing of HDTV based applications The paper finally hints out the reconfigurable block matching hardware that could be useful to general purpose real time video processing unit References [1] V Do and K Yun A low-power vlsi architecture for fullsearch block-matching IEEE Tran Circ and Sys Video Tech, 8(4):393 398, Aug 1998 [2] C Hsieh and T Lin Vlsi architecture for block-matching motion estimation algorithm IEEE Tran Circ and Sys Video Tech, 2(2):169 175, June 1992 [3] Y Jehng, L Chen, and T Chiueh Efficient and simple vlsi tree architecture for motion estimation algorithms IEEE Tran Sig Pro, 41(2):889 899, Feb 1993 [4] T Komarek and P Pirsch Array archtectures for block matching algorithms IEEE Circ and Sys, 36(10):1301 1308, Oct 1989 [5] Y Lai and L Chen A data-interlacing architecture with two-dimensional data-reuse for full-search block-matching algorithm IEEE Tran Circ and Sys Video Tech, 8(2):124 127, April 1998 [6] S Lin, P Tseng, and L Chen Low-power parallel tree architecture for full search block-matching motion estimation In Proc of Intl Symp Circ and Sys, volume 2, pages 313 316, May 2004 [7] H Loukil, F Ghozzi, A Samet, M Ben Ayed, and N Masmoudi Hardware implementation of block matching algorithm with fpga technology In Proc Intl Conf on Microelectronics, pages 542 546, Dec 2004 [8] M Mohammadzadeh, M Eshghi, and M Azadfar Parameterizable implementation of full search block matching algorithm using fpga for real-time applications In Proc 5th IEEE Intl Caracas Conf on Dev, Circ and Sys, Dominican Republic, pages 200 203, Nov 2004 [9] J Olivares, J Hormigo, J Villalba, I Benavides, and E Zapata Sad computation based on online arithmetic for motion estimation Jrnl Microproc and Microsys, 30:250 258, Jan 2006 [10] N Roma, T Dias, and L Sousa Customisable core-based architectures for real-time motion estimation on fpgas In Proc of 3rd Intl Conf on Field Prog Logic and Appl, pages 745 754, Sep 2003 [11] A Ryszko and K Wiatr An assesment of fpga suitability for implementation of real-time motion estimation In Proc IEEE Euromicro Symp on DSD, pages 364 367, 2001 [12] A Saha and S Ghosh A speed-area optimization of full search block matching with applications in high-definition tvs (hdtv) In To appear in LNCS Proc of High Performance Computing (HiPC), Dec 2007 [13] L Sousa and N Roma Low-power array architectures for motion estimation In IEEE 3rd Workshop on Mult Sig Proc, pages 679 684, 1999 [14] J Tuan and C Jen An architecture of full-search block matching for minimum memory bandwidth requirement In Proceedings of the IEEE GLSVLSI, pages 152 156, Feb 1998 [15] L Vos and M Stegherr Parameterizable vlsi architectures for the full- search block- matching algorithm IEEE Circ and Sys, 36(10):1309 1316, Oct 1989 [16] C Wang, S Yang, C Liu, and T Chiang A hierarchical n- queen decimation lattice and hardware architecture formotion estimation IEEE Transactions on CSVT, 14(4):429 440, April 2004 [17] S Wong, V S, and S Cotofona A sum of absolute differences implementation in fpga hardware In Proc 28th Euromicro Conf, pages 183 188, Sep 2002 [18] K Yang, M Sun, and L Wu A family of vlsi designs for the motion compensation block-matching algorithm IEEE Circ and Sys, 36(10):1317 1325, Oct 1989 [19] Y Yeh and C Lee Cost-effective vlsi architectures and buffer size optimization for full-search block matching algorithms IEEE Tran VLSI Sys, 7(3):345 358, Sep 1999 [20] H Yeo and Y Hu A novel modular systolic array architecture for full-search blockmatching motion estimation In Proc Intl Conf on Acou, Speech, and Sig Proc, volume 5, pages 3303 3306, 1995 18