Xuena Bao, Dajiang Zhou, Peilin Liu, and Satoshi Goto, Fellow, IEEE

Size: px

Start display at page:

Download "Xuena Bao, Dajiang Zhou, Peilin Liu, and Satoshi Goto, Fellow, IEEE"

Edward Cornelius Richardson
5 years ago
Views:

1 An Advanced Hierarchical Motion Estimation Scheme with Lossless Frame Recompression and Early Level Termination for Beyond High Definition Video Coding Xuena Bao, Dajiang Zhou, Peilin Liu, and Satoshi Goto, Fellow, IEEE Abstract In this paper, we present a hardware-efficient fast algorithm with a lossless frame recompression scheme and early level termination strategy for large search range (SR) motion estimation (ME) utilized in beyond high definition video encoder. To achieve high ME quality for hierarchical motion search, we propose an advanced hierarchical ME scheme which processes the multi-resolution motion search with an efficient refining stage. This enables high data and hardware reuse for much lower bandwidth and memory cost, while achieving higher ME quality than previous works. In addition, a lossless frame recompression scheme based on this ME algorithm is presented to further reduce bandwidth. A hierarchical memory organization as well as a leveling two-step data fetching strategy is applied to meet constraint of random access for hierarchical motion search structure. And the leveling compression strategy by allowing a lower level to refer to a higher one for compression is proposed to efficiently reduce the bandwidth. Furthermore, an early level termination method suitable for hierarchical ME structure is also applied. This method terminates high level redundant motion searches by establishing thresholds based on current block mode and motion search level; it also applies the early refinement termination in order to avoid unnecessary refinement for high levels. Experimental results show that the total scheme has a much less bit rate increasing compared with previous works especially for high motion sequences, while achieving a considerable saving of memory and bandwidth cost for large SR of [-128, 127]. Index Terms Beyond high definition, early level termination, hierarchical motion estimation, lossless frame recompression, video coding. Manuscript received July 6th, This research was supported by Waseda University Ambient SoC Global COE Program of MEXT, Japan, by Knowledge Cluster Initiative (2nd Stage) of MEXT, Japan, and by CREST of Japan Science and Technology Agency. Xuena Bao was with the Graduate School of Information, Production and Systems, Waseda University, 2-7 Hibikino, Kitakyushu, , Japan. She is now with the Department of Electronic Engineering, Shanghai Jiao Tong University, No.800 Dong Chuan Road, Shanghai, , China (phone: ; fax: ; baoxuena@sjtu.edu.cn). Dajiang Zhou is with the Graduate School of Information, Production and Systems, Waseda University, 2-7 Hibikino, Kitakyushu, , Japan ( zhou@fuji.waseda.jp). Peilin Liu is with the Department of Electronic Engineering, Shanghai Jiao Tong University, No.800 Dong Chuan Road, Shanghai, , China ( liupeilin@sjtu.edu.cn). Satoshi Goto is with the Graduate School of Information, Production and Systems, Waseda University, 2-7 Hibikino, Kitakyushu, , Japan ( goto@waseda.jp). I I. INTRODUCTION n order to provide high quality perception, TV resolution grows dramatically, and beyond high definition (beyond HD) videos such as QFHD (quad full high definition, the definition is 4320x2160/2160p) and SHV (Super Hi-Vision, the definition is 7680x4320/4320p) sequences become a trend for real applications. Although H.264/AVC video coding standard has been widely adopted in many video devices which provide good coding efficiency, most of them only support high definition (HD) video or below because of high computational complexity of ME part. For the sequences beyond high definition, motions among neighboring frames are higher compared with the lower definition sequences with the same visual contents ([1]), therefore the required search range has to be up to [-128, 127] or even larger in order to capture the motions more accurately. For these beyond HD sized applications with large SR, the huge consumptions of resources for previous ME approaches become the bottleneck of the encoder chip design. First and foremost, since the reference data are stored in dynamic random access memory (DRAM) and they have to be accessed during ME process, the DRAM bandwidth requirement becomes rather huge for encoding beyond HD sequences, which goes beyond the limitation of current DDR2 and DDR3 techniques. Besides, huge DRAM traffic also means a significant consumption of the total system power. In addition, the area cost as well as chip pin count grows dramatically, which leads to high chip design cost. Furthermore, previous approaches consume too many computational cycles, which become not suitable for the real time encoder system. Recently there are some motion estimation algorithms such as [2, 3], whose estimation ranges are targeted for HD sequences or below. [2] proposes an update-type motion estimation scheme with a multi-resolution approach for motion compensated image interpolation. [3] presents a fast modified diamond search algorithm for motion estimation. In addition, some video coding strategies aiming to support the high definition sequences are also proposed ([4, 5, 6]). [4, 5] propose their intra prediction schemes suitable for high definition videos, while [6] comes up with a memory interface architecture for high definition video coding.

2 In order to ensure the performance for beyond high definition cases, [1, 7] propose some strategies that are suitable for beyond high definition video coding. However, the motion estimation strategy is FS with SR of 128 in [1], while [7] proposes a set of diagonal partition shapes for variable motion partitioning. Both of the algorithms consume much calculation time and bandwidth, which are not suitable to be directly hardware implemented. For the fast ME algorithms that have been proposed to support large search range with hardware implementations, there are two promising ME structures which include cache-based ME (CBME) in [8] integrated in a quad HDTV sized video encoder ([9]) and parallel multi-resolution ME (PMRME) in [10] integrated in [11]. For CBME, a [-16, +15] refinement is done based on the best motion vector selected from several predicting candidates. Although it performs well for small and uniform motions, the coding efficiency loss becomes rather large for high motions due to limited motion prediction range. PMRME uses three independent levels with two sub-sampling levels that cover large ranges and one fine level that covers small range. This architecture solves the dependency problem with significant savings of on chip memory and bandwidth. However, if the motion falls in the area of sub-sampling levels, the motion vector (MV) in coarse search will be directly passed on to the next stage without refinement, leading to non-ignorable quality loss. In addition, although large bandwidth saving can be achieved in integer ME part, the data in sub-sampling levels cannot be reused in fraction ME, which causes more bandwidth when reference data without sub-sampling have to be fetched for fraction motion search. To solve above problems, an advanced parallel multi-resolution ME algorithm with a coarse-to-fine strategy is proposed in this paper. In order to avoid the quality loss of CBME for high motion sequences, the proposed algorithm applies the hierarchical motion estimation strategy with the integration of large area sub-sampling searches. Then, different from PMRME that only searches sub-sampling values for high levels, the proposed algorithm applies the coarse search based on the average value of each sub-sampled block and then the motion vector with minimum cost will be refined inside the block. Since the refinement can reuse the low level hardware and the original reference data without sub-sampling in the refining stage can be reused in fraction ME, the area and bandwidth cost does not increase and the latency problem can also be solved by a suitable pipeline strategy, while the ME quality will be much better. Furthermore, frame recompression is applied to reduce the bandwidth. Until now there are some works concentrating on this technique such as [12, 13, 14, 15, 16, 17]. However, all of these works divide the reference picture into small blocks and compress each block based on a general suggestion that all of the pixels of the block will be needed for motion search, appearing not suitable for the hierarchical ME application. Therefore, we propose a lossless frame recompression scheme based on the proposed hierarchical ME algorithm. A hierarchical memory organization is proposed to meet constraint of random access for the hierarchical ME search structure. Then a leveling two-step data fetching strategy that fetches needed data separately from three levels is applied to extend random access to a lower level without leading to latency problems. Furthermore, in order to compress the reference data in each level, the differential values are calculated by referring current values to the average values stored in the higher level and an efficient coding method that is suitable for compressing these differential values is also applied. In order to terminate unnecessary large area searches for small motions, an early level termination method based on the hierarchical ME algorithm is applied. This method chooses different thresholds for different motion search levels based on the motion cost and estimated motion search improvement, thus it is able to terminate the redundant high level searches according to the thresholds. Furthermore, the early refinement termination is also applied to avoid redundant refining searches for high levels, which aims to save extra bandwidth cost by unnecessary refinement. Experimental results show that the proposed early level termination strategy can effectively reduce the bandwidth with little quality degradation. In section 2, we will introduce the hierarchical ME architecture, while the proposed frame recompression scheme will be explained in section 3. The proposed early level termination strategy will be presented in section 4. The experimental results are given in section 5. Finally, conclusions are drawn in section 6. II. PROPOSED HIERARCHICAL ME SCHEME A. The Basic Idea of Proposed Algorithm The proposed parallel multi-resolution ME algorithm includes three levels with one fine level and two sub-sampling levels. The fine level without data sub-sampling covers the search range with MV for small motion. On the other hand, the other two levels with data sub-sampling cover the large search range to find the large search motion vector. After the motion search, the search results of three levels are compared and the motion vector with minimum cost is chosen as the final ME result. This hierarchical searching strategy ensures the large prediction range for high motions, which is able to achieve better performance than previous strategies which only process small area searches (such as CBME). In the sub-sampling levels for hierarchical ME, one big problem is that the ME quality loss is inevitable compared with large area full search if only sub-sampling pixels are searched, such as the PMRME searching method proposed in [10]. In order to reduce the quality loss caused by the sub-sampling search, after the coarse search based on the average value of each block, the motion vector with minimum motion cost will be further refined to find the best matched position. The refining strategy results in a better coding performance than PMRME. In addition, since the original reference data for the refinement can be reused in fraction ME, the extra bandwidth consumption for fetching the original data based on the sub-sampled MV during the fraction ME process in [10] can

3 Level 0 SR=[-7~+6] Level 1 SR=[-32~+31] Level 2 also be avoided. Center: (0,0) Center: (0,0) SR=[-128~+127] Center: Pred MV Average value Average value Pred MV Fig. 1. Hierarchical ME algorithm. Fine search Best MV Fine search Best MV TABLE I MODES SUPPORTED FOR DIFFERENT LEVELS Level Block Size 0 16x16,16x8,8x16,8x8 1 16x16,16x8,8x16,8x8 2 16x16 B. The Three Levels of Proposed Algorithm The proposed hierarchical ME algorithm is illustrated in Fig. 1. In the lowest level, level 0, the SR is set to [-7, +6]. Since the predictive motion vector (PMV) has relatively high probability for final MV in small motion search, it is chosen as the search center. This level does fine search based on original reference data without sub-sampling. In addition, all variable block size modes are enabled (Table I). In order to save bandwidth and maintain the coding efficiency, we only support block sizes that are larger than or equal to 8x8, as is in [7, 18]. According to [18], this approach is able to maintain the coding performance in most cases, especially when RDO (rate distortion optimization) is off. In level 1, the SR is enlarged to [-32, +31]. It is centered on the current block position in the reference picture, which is defined as original point (0, 0). This enables regular memory reuse between successive MB processing, as is illustrated in [10]. In this level, the 4:1 sampling strategy is applied in which only the average value of each 2x2 block from level 0 is searched by comparing it to the average value of current corresponded block and all of the 16x16 to 8x8 modes are chosen. After the coarse search, the MV with minimum cost pointing to the center of the 2x2 block will be refined inside the block. The refinement calculates the motion cost based on the original reference data without sub-sampling. In level 2, the SR is the largest, [-128, +127], and also centered on (0, 0). In this level, the average value of each 2x2 block from level 1 (which is also the average value of each 4x4 block from level 0) is searched and only the 16x16 mode is chosen since other modes will contain too few average values for SAD calculation. Also, the coarse MV will be refined inside the 4x4 block. In the three parallel levels, level 2 provides a large search range for relatively high motions. According to [1], a higher definition sequence is generally more homogeneous than a lower definition sequence with a same video content. Therefore, the average values for 16:1 sampling are used for SAD calculation in order to predict the MV without causing much quality loss. And the refinement helps further locating the MV to the integer pixel level. Similarly level 1 provides a finer precision for medium motion and most low motion vectors will fall in level 0. C. The Calculation Structure of Proposed Algorithm The ME calculation scheme is shown in Fig. 2. Every calculation component is decomposed as the combinations of the primitive calculation modules for each 4x4 block. Each level composes exactly 16 primitive modules to balance the area costs of three levels, as is illustrated in [10]. Furthermore, in the proposed ME structure with coarse-to-fine strategy, the level 0 hardware is also applied to do the refining search for the coarse MVs resulting from high levels. Therefore, after the high levels coarse searches, the motion vectors with minimum costs are transmitted to level 0 calculation module for further refinement. Since the refinement for high levels has similar process with the level 0 search, the level 0 hardware can be directly reused for the refinement process. Finally, the level 0 search result, as well as the refining Current MB Level 2 buffer Average values for each 4x4 Level 0 buffer Original values Level 1 buffer Average values for each 2x Level 2 SAD module 0 Level 2 SAD module 1 Level 2 SAD module 14 Level 2 SAD module 15 Level 0 SAD module 0 (including refining search for level 1 and level 2) Level 1 SAD module 0 Level 1 SAD module 1 Level 1 SAD module 2 Level 1 SAD module 3 8x8 SAD tree 0 minimun 8x8 SAD tree 0 8x8 SAD tree 1 8x8 SAD tree 2 8x8 SAD tree 3 Fig. 2. Hierarchical ME structure. minimun Level 0 MV Level 1 MV Level 2 MV Best MV for each mode

4 MV results for the two high levels is compared and the motion vector with minimum motion cost is chosen as the final ME result. In order to ensure time balance of three levels calculation modules, the SR of level 0 is adjusted to [-7, +6]. Since the level 0 calculation module contains totally 16 primitive modules, it is able to calculate the SAD values of variable sub-blocks inside the MB for one search point in one clock cycle. As a result, the clock cycles for level 0 search are 196 cycles for SR of [-7, +6] (14x14). For the refining searches, it is possible that the search centers of each sub-block inside the MB with the sizes below 16x16 are different, when they are resulting from high level searches for the sub-block modes. As a result, they cannot be searched at the same time for calculating the SAD costs inside the MB. Hence the refining searches for high levels cost at most 52 cycles for SR of [-1, 0] (4 cycles of each block for 9 blocks (one 16x16 +two 8x16 +two 16x8 +four 8x8)) in level 1 and [-2, +1] (16 cycles for one 16x16 block) in level 2. As a result, the total calculation time including refining search (248 cycles) for the level 0 calculation module almost equals to the other two levels of 256 cycles, which ensures the balance of calculation cycles for three levels. And the total calculation process for three levels can be synchronized and pipelined by delaying the level 0 search for the first MB by 52 cycles for initializing, while this delay can be negligible compared with the total motion estimation time for one frame. Therefore, the calculation cycles for each MB don t increase compared with the scheme proposed in [10]. III. FRAME RECOMPRESSION SCHEME A. Variable Compression Ratio Based Lossless Frame Recompression Scheme Frame recompression (FRC) is a technique to compress the data before storing them into the frame memory, and decompress the data fetched back. The proposed frame recompression scheme applies variable compression ratio based compression strategy, which is to divide the reference picture into small blocks and compress each block with any ratio. This strategy is different from the fixed-compression ratio model that compresses each block into the same size ([14, 15, 16, 17]), and it is able to avoid the shortcomings of the fixed-compression ratio strategy. On one hand, some blocks have a higher potential for compression can be compressed with a relatively high compression ratio. On the other hand, for those blocks which have a lower compression potential, they don t have to be fitted into the designated compression ratio, thus no quality loss happens. In addition, the consequent drift error, which means the error propagation because of the quality loss of reference frames, can also be avoided. There are two published works based on variable compression ratio ([12, 13]). In their structure, the uncompressed reference frame is divided into groups and each group is further divided into partitions. The compression for the frame is processed based on each partition. In order to support variable compression rate, each partition can be compressed 2 partitions y y 4 partitions Group 0 Group 1 MB Group m Group m+1 Group 0 Group m Group m+1 Group 1 x x Address Initial Reference <mode 0> <mode 1> Fig. 4. DPCM scanning modes. P0 P1 P1 P1 P2 P2 P3 P3 P4 0 (bit) with any ratio and compressed partitions are stored compactly in their original groups, as is shown in Fig. 3. When compressing each partition, the length is also recorded into DRAM, which can be used to derive the offset of the compressed partition inside the group. As a result, a two-step data fetching strategy that fetches the length information and the compressed partition can be processed in order to fetch a compressed partition. To compress each partition, various scanning modes are utilized to calculate DPCM (Differential Pulse Code Modulation) values (as is shown in Fig. 4), and then, variable length coding (VLC) is applied to these values to express them in fewer bits. The mode with the shortest bit length is chosen as the final scanning order and the coded bits are stored into DRAM. B. Hierarchical ME Based Frame Recompression Scheme The proposed frame recompression scheme is based on the hierarchical ME scheme, thus it applies the hierarchical compression strategy. Although the frame recompression scheme based on variable compression ratio proposed in [12] and [13] proves to achieve considerable bandwidth reduction, it is based on single level motion estimation structure, which compresses and fetches all pixels of the block. So it appears not suitable to be integrated into the hierarchical ME scheme. The proposed scheme applies the hierarchical compression strategy by allowing a lower level to refer to a higher one for compression in order to minimize the total information for storing. During the hierarchical ME process, the compressed data are fetched by applying the leveling data fetching strategy and decompressed by referring to the high level average values AU 64 Length (AU) L0 = 0.75 L1 = 1.5 L2 = 1.5 L3 = 0.75 Start address S0 = 0 S1 = S0+L0 S2 = S1+L1 S3 = S2+L2 Fig. 3. Memory organization for reference picture; compressed reference picture; calculating the start address of a partition in a group. P# stands for partition #, L# stands for its length, and S# stands for its start address.

5 2 partitions y 2 partitions Group 0 Group m 8 (samples) Group 1 Group m+1 8 (samples) MB C. Hierarchical Memory Organization In the proposed scheme, the DRAM bus width is set to be 64 bits, which is considered as access unit (AU). For an uncompressed reference frame, it is divided into groups of 16x16 samples. Furthermore, each group is divided into 4 partitions of 8x8 samples. According to the hierarchical ME structure, the reference average values are stored in high level memories. As a result, level 0 stores the total pixels of each partition, while level 1 memory stores the average value of each 2x2 block in level 0 and level 2 stores the average value of each 2x2 block in level 1 (which is also the average value of each 4x4 block in level 0). The 5/16 more average values are stored and used to compress the data in the lower level at the recompression part, thus the total information for storing after applying the compression strategy doesn t increase in comparison with previous works. The hierarchical memory organization is shown in Fig. 5. The compression for the frame is processed based on each partition in each level (except level 2 that will not be compressed). After the frame is compressed, compressed partitions of each level (partitions of level 2 include original data) are stored compactly in their original groups, as is shown in Fig. 6. This structure helps randomly locating the compressed partition to the group level, with unused space not adding to DRAM data transfer. D. Leveling Two-Step Data Fetching Strategy In order to fetch the compressed partition, the leveling two-step data fetching strategy is applied in this work. After each partition is compressed in level 0 and level 1, the length as well as the content of the compressed partition is recorded and stored into DRAM. As a result, when it needs to access to a compressed partition in each level, two steps for fetching the length information and the compressed partition can be processed in a pipeline to solve the latency problem. And the start address of one partition can be obtained by accumulating the lengths of its previous partitions in the same group, as is shown in Fig. 6. Although the access to the length information may cause extra bandwidth requirement, if we take the 8 bpp (bits per pixel) picture format as an example, only 9 bits for level 0 and 7 bits for level 1 are needed to record the length of every compressed partition, compared with the original 512 bits (level 0) and 128 bits (level 1) data. The analyzing can also be extended to other formats, and for higher pixel depth, the ratio of the length bits compared with the original data becomes x Level 0 (1) Level 1 (1/4) Level 2 (1/16) Fig. 5. Hierarchical memory organization for reference picture. y Group 0 Group 1 Group m Length (AU) L0 = 2.25 L1 = 1.5 L2 = 1.25 Group m+1 Start address S0 = 0 S1 = S0+L0 S2 = S1+L1 x Address P0 P0 P1 lower. So the overhead can be nearly negligible in comparison with the bandwidth saved at the compression part. In addition, the overhead caused by the length table in each level can be further reduced, once the partition lengths of one group can be buffered in a cache. E. Differential Value Calculation and Variable Length Coding Since level 2 only includes the 1/16 data of a partition, the compression contributes too little to the bandwidth reduction so that this level s data will not be compressed. To compress each partition in level 0 and level 1, the difference value of each pixel is calculated by comparing it to the average value of the 2x2 block which it belongs to, and then, VLC is applied to these differential values to express them in fewer bits. For most cases, the pixel values of each block are normally distributed around the average value of the block. Therefore, the compression ratio will be relatively high by subtracting the average value from current pixel and encoding the differential value. Since the average values of current level are stored in its higher level, it will be not difficult to fetch the corresponded reference average values from the higher level when decompressing the data, without causing extra calculation time. To encode the differential values, a new method is chosen in this paper. Each partition is divided into 2x2 blocks (16 blocks for level 0 and 4 for level 1). As is shown in Table II, the category of each 2x2 block is selected according to the block s maximum absolute value, and the category indicator is encoded with variable length coding according to its popularity. Then, each value inside the block is encoded by using the corresponded method. From the table, we can see that the length of a coded value inside one block has only two possibilities, which contributes to a decoding algorithm with much less dependency in comparison with conventional variable length coding methods. If the maximum absolute value of one block is greater than 20, this block is expressed with the original 8-bit samples. Finally, the category indicators and coded differential values of a partition are stored into DRAM. However, if the total length of the compressed partition is 64 P0 P2 3 P1 P2 AU 0 0 (bit) Level 0 (compressed) 0 P0 P1 1 P2 Level 1 (compressed) P0 P1 Level 2 (original) Fig. 6. Hierarchical memory organization for compressed reference picture; calculating the start address of a partition in a group. P# stands for partition #, L# stands for its length, and S# stands for its start address.

6 TABLE II VARIABLE LENGTH CODING FOR DIFFERENTIAL VALUES, S STANDS FOR THE SIGN BIT OF DIFFERENTIAL VALUE Category A B C D E F Indicator Max. Abs ±1 0S 1S 00S 001S 0001S ±2 00S 01S 010S 0010S ±3 10S 011S 0011S ±4 000S 100S 0100S ±8 0000S 1000S ± S ± S ± S ± S TABLE III COMPRESSION EFFICIENCY COMPARISON FOR THE PROPOSED VLC METHOD AND EXPONENTIAL-GOLOMB METHOD Bandwidth Reduction (%) Sequence Proposed Exp-Golomb Exp-Golomb Exp-Golomb (order 0) (order 1) (order 2) -57.6% -44.2% -46.4% -44.8% -69.4% -55.5% -55.8% -51.7% greater than the original length, the original samples are directly stored into DRAM without compression. In order to test the performance of the VLC method, we utilize the VLC method to compress and write 10 frames of two sequences into DRAM, and the reduction of written data bandwidth is calculated by comparing the compressed data to the original frame data without compression. Table III shows the bandwidth reduction of the VLC method by comparing it to the Exponential-Golomb coding method with different orders. Exponential-Golomb is a universal code which does not take advantage from the redundancy between code words. But in real pictures, this redundancy (in our design, each code word represents a difference between samples) is very high. Since the VLC method utilized in the proposed FRC scheme can utilize this redundancy to improve compression efficiency, it is able to achieve better compression effect, as is shown in Table III. F. Summary Data flow of the proposed frame recompression scheme is summarized in Fig. 7. The reference information for three levels is recompressed, and then the length tables and the compressed data of two levels as well as the original data of level 2 are generated and stored into DRAM separately. In this scheme, the three levels data buffers store decompressed data to avoid cross accesses between buffers. As a result, in the process of parallel multi-resolution ME, the block of reference values needed is checked out for current level, and then the data buffer is checked whether it stores the needed partitions for the reference block. If the data buffer in level 2 does not contain the needed data for level 2 search, the data will be directly fetched Reconstructed picture Level 0 data Level 2 data Compressed data length data fetch fetch (comp. data) Length buffer 0 length info. Compressed fetch (comp. data) data Level 1 data Recompressor length fetch Length buffer 1 length info. fetch from DRAM and update the cache. However, if a miss is detected for level 1 data buffer, the length buffer is checked and updated to get length information and then compressed data will be fetched. To decompress the data, the data buffer in level 2 has to be checked and updated in order to get average values. Similarly for level 0, the level 0 length buffer and level 1 data buffer will be checked and updated if level 0 data buffer misses the needed data, and the updating of level 1 data buffer also needs the checking of the length buffer in level 1 as well as the data buffer in level 2, as is mentioned above. In this processing flow, there are two types of DRAM latency, for length fetching and compressed data fetching respectively. However, since there is no feedback from the fetched information to subsequent DRAM requests, it will not be difficult to pipeline the whole flow by putting the DRAM requests in a queue for latency concealment. In order to implement the FRC strategy for the hierarchical ME structure (Fig. 2), we have to implement and integrate several new modules into the ME structure (which are not included in Fig.2). They include the length buffers of level 0 and level 1 that store lengths of compressed partitions, the recompressor that processes the compression and generates the compressed data for DRAM storing, as well as the decompressor to decompress the fetched compressed partitions, according to the data flow in Fig. 7, while the data buffers for three levels to store the decompressed reference data are the same as original three levels data buffers in Fig. 2. IV. EARLY LEVEL TERMINATION STRATEGY A. Early Level Termination Method Level 0 Decompressor Level 1 Decompressor Ave. info. Fig. 7. Data flow of the proposed scheme. Level 0 Search Level 0 data buffer Level 1 Search Level 1 data buffer Ave. info. Level 2 Search Level 2 data buffer Ref. samples Ref. samples Ref. samples Since the proposed hierarchical ME scheme is based on the worst situation in which the motion is generally fast for beyond high definition videos, it applies the large prediction range as well as the refining strategy in order to maintain much better coding quality. However, for relatively lower motion sequences, only low level motion search with small search range can achieve similar ME performance for most cases. In this kind of situation, the applying of all three levels searches will lead to extra cost of bandwidth and calculation time. Therefore, an early level termination method is applied to terminate the large area search in high levels if the low level

7 search is predicted to generate an ideal MV. The proposed early level termination method firstly classifies each block based on both the motion cost and motion search improvement ([19]) to determine the small area search class. Furthermore, for the class that needs large area search, the proposed method further applies the threshold for the two high motion search levels, thus it is able to apply level 1 and level 2 searches accordingly. In the ME process, we define COST as the ME cost for designated MV: ( ) (1) where SAD is the Sum of Absolute Difference for the block matching error, R(MV) is the number of bits to code the MV, and is a constant factor, which has the same value as the lambda factor defined in H.264 for determining motion cost during the motion estimation. In order to apply the early level termination method, the predict motion cost is firstly checked. The is defined as: where is the COST of PMV. For each block, the classification strategy is defined as: (2) blocks appear to have uniform motions with small values, which can be motion predicted within a small search range centered on MVP, similarly as most previous early termination strategies ([20, 21, 22, 23]). If the predicted motion cost is larger than the threshold, the expected motion search improvement is considered. A small motion distance difference between the predicted final MV and MVP (which is corresponded to class 2) means that there is a high possibility that even the large area search results in the MV near MVP. Thus the large area search will not improve the ME performance a lot, and a small search range is generally enough to find a suitable nearby MV ([19]). Under this kind of situation, we also only apply the level 0 motion search. For the blocks that need large area search, they are further classified into two conditions. If the block falls into class 3, it is reasonable to consider that a relatively large area search is needed to find a more suitable MV away from MVP. Therefore, an addition of level 1 search is applied for the blocks of class 3. For the blocks of class 4 with high motion property, the necessary search range is the largest, thus all three levels searches are turned on in order to find the accurate MV with large distance. B. Thresholds Selection In the proposed method, three thresholds have to be decided, which are th, and. For th, different values are applied according to the motion estimation mode of current block. The th value of current block is defined as: { (3) { ( ) ( ) (4) where curblock is the current block, and is the final MV of the co-located block. th is the threshold to decide whether the is a reasonable motion cost for current block, while and are thresholds to decide the significance of motion distance between PMV and. According to the experimental results of detection rates in [19], has a high accuracy to estimate the real final MV of current block for most of the time. As it is impossible to obtain the real final MV of current block at the beginning of actual ME process, is applied to predict the actual final MV in order to calculate the motion distance error of the PMV. In addition, since there are two high sub-sampling levels according to the proposed hierarchical motion estimation strategy, two thresholds ( and ) are applied respectively to further classify the current block to decide the suitable search strategy. The setting of thresholds will be further discussed in the next subsection. The process of the proposed early level termination algorithm is shown in Fig. 8. Firstly is calculated according to the PMV of current block (Eqn. (1) & Eqn. (2)). Then the classification strategy (Eqn. (3)) is applied to current block. From Eqn. (3), it is easy to find out that the first class of where is the of co-located block in the previous frame, is the of the previous encoded neighboring block, and is the of the 16x16 mode for current MB. In addition, mode is the current motion estimation mode of the block. Only perform Level 0 search Compute COSTpred Apply Eqn. (3), classfy current block into four classes Class curblock =? 1&2 3 4 Level 0 & Level 1 search All three levels search Fig.8. Data flow of the proposed early level termination scheme.

8 TABLE IV BIT RATE AND TOTAL BANDWIDTH COMPARISON FOR DIFFERENT SELECTIONS OF Sequence =8 =12 =16 =24 =32 Sequence +3.11% +3.13% +3.18% +3.29% +3.36% +2.06% +2.24% +2.62% +3.75% +6.63% +0.43% +0.60% +0.83% +2.33% +2.97% +0.07% +0.09% +0.12% +3.17% +5.04% +0.49% +0.54% +0.56% +1.63% +3.45% According to the simulation results in [20], the motion cost of the best search point for the 16x16 mode remains similar between consecutive frames and provides a good basis for predicting the of current frame MB, while the of the 16x8, 8x16, and 8x8 modes are highly correlated with the 16x16 mode of the same MB. So the threshold value can be defined from a combination of the motion cost values of the most correlated blocks according to current block mode. Since the search range of level 0 is 7, the threshold for motion distance prediction can be set as 7 for. For level 1 with the sub-sampling search (SR 32) centered on (0, 0), a series of thresholds for are tested by encoding five beyond HD sequences. And the increasing of bit rate compared with full search, as well as the total bandwidth is examined, as is shown in Table IV. From this table, we can find that an increasing of will lead to a further reduction of total bandwidth but with a higher quality loss. When the value of is small, such as 8, an increasing of will avoid more redundant level 2 searches, which means that the quality loss is small and negligible. However, if the value of grows to be large, such as 32 in Table IV, more and more necessary level 2 searches will be terminated, which leads to a drastic increasing of the bit rate and quality loss. Although different values can be chosen according to the requirement, we chose 16 for in our experiment in order to achieve a considerable bandwidth reduction while ensuring the ME performance. C. Early Refinement Termination Total bandwidth (M Bytes) =8 =12 =16 =24 = In the proposed method, the early refinement termination is further applied to reduce the bandwidth. By applying the refining strategy for the sub-sampling levels in the proposed TABLE V THE HITTING RATES OF THE EARLY REFINEMENT TERMINATION ALGORITHM Hitting rate (%) Sequence Level 1 Level % 93.56% 90.48% 91.67% hierarchical ME scheme, the MV from the coarse search will be further refined to the integer-pixel level and the ME quality will be much better. However, the refining process reuses the level 0 module and it has to read the original values without sub-sampling from DRAM if the level 0 data buffer misses the needed data. If the refining process does not improve the final result of coarse search, this part of data without sub-sampling cannot be reused in fraction ME part since the low level search MV is chosen as the final result of integer ME, which causes extra bandwidth. In order to avoid unnecessary refining search and reduce this part of DRAM traffic, an early refinement termination strategy is proposed. In the process of high level search, the coarse search result will be transmitted to the local refining search. After the coarse search based on the sub-sampling values, the motion cost of the coarse MV is compared to the level 0 fine search result. And the refinement will be terminated if the of the coarse MV is larger than that of the MV resulted from low level search, as the level 0 search is considered to be enough to predict the motion without causing too much quality loss. Table V shows the hitting rates for the early refinement termination algorithm based on the encoding of two sequences. In the experiment, the total times when the early refinement termination condition is satisfied are calculated separately for level 1 and level 2 searches. And for each high level, the hitting times are also calculated when the refinement termination condition is satisfied and the motion cost of final MV in the high level even with refining strategy also appears more than that of the MV resulted from low level search. The hitting rates are calculated by comparing the hitting times to the total times for each high level. Table V shows that the proposed method has high hitting rates and has the ability to avoid most unnecessary refining searches successfully. V. EXPERIMENTAL RESULTS A. Simulation Condition In order to evaluate the performance of the total proposed scheme, the simulation is done based on JM 15.1 for typical sequences. Since we don t have SHV sized sequences, the two QFHD sized sequences (crowdrun and parkjoy) are expanded by the bilinear interpolation method to generate the 4320p sequences. In the experiment, it is hard to run the full 4320p sequences due to the huge requirement of computer memory, thus we only encode the middle 4320x2160 areas of the two 4320p sequences. The encoding parts have the same motion distances as the original sequences, and the basic method of the proposed hierarchical ME algorithm is mainly influenced by the motions rather than the picture definition. Therefore, this experimental strategy is expected to be able to

9 simulate the performance of the proposed scheme. In addition to the two 4320p sequences, three 2160p sequences as well as several 720p and 1080p sequences are also simulated. According to [1], the SHV sequences have relatively higher motions than the lower definition sequences with similar video contents. Therefore, several frames of these test sequences are skipped in order to simulate the motion distance of SHV sequences. Since the MV precision of SHV sequences is doubled compared with QFHD sized sequences, each 2160p sequence is estimated to skip 2 frames in order to achieve similar motion vector distance. Similarly, according to the MV scaling compared with the SHV sequences, the number of skipping frames is estimated to be about 4 for 1080p and 8 for 720p sequences. Therefore, we select a skipping of 2 frames for 2160p sequences, while a skipping of 4 frames is selected for 1080p sequences and a skipping of 8 frames is selected for the 720p sequences. Currently we have got all of the sequences with the format of 4:2:0 color sampling, 8 bpp (bits per pixel) and modeled the simulation based on this condition. Since the basic method of the proposed hierarchical ME and FRC algorithm is not influenced by the frame format, it is expected to be extended to the other conditions. In the experiment, all of the sequences are encoded by 15 frames with IPPP frame structure and RDO off. B. Cache Organization In order to realize the total proposed scheme, two kinds of caches are selected as pre-fetch buffers for storing the reference data (level 0~2 data buffer in Fig. 7) as well as the partition lengths (length buffer 0~1 in Fig. 7). For the data buffers to store the decompressed reference data for three levels, the cache organization shown in Fig. 9 is 6 groups 6 groups y 6 groups TABLE VI BUFFER SIZE FOR THREE LEVELS Memory Cost Data Buffer size Length Buffer size Level 0 (Kbyte) Level 1 (Kbyte) Level 2 (Kbyte) Total (Kbyte) Total Direct design Saving (%) % x Fig. 9. Cache organization for data buffers. The cache size is 6x6 groups for level 0 and level 1, and 18x18 groups for level 2. 6 groups applied. It is set to have 6x6 groups for level 0 and level 1, and 18x18 groups for level 2. According to our experiment, this kind of organization can reach the best bandwidth result. As a result, the remainder of group address divided by the cache size and the partition address inside the group are used as index to locate each partition, while the quotient is tag to judge whether this location stores the partition that is wanted. Two fully associative caches based on FIFO (first in first out) are applied to store lengths of compressed partitions in level 0 and level 1. For each level, since lengths of other partitions in the same group are always used to calculate the start address, the partition lengths in one group are processed as a unit. To increase the efficiency for each DRAM access, length information in every 4 horizontal groups is fetched to update the designated length cache when a miss is detected. Table VI shows the sizes of the three levels data buffers as well as the length buffers of three levels, indicating only 1% cache size increasing for length buffering. And the proposed ME scheme can save considerable on chip memory compared with direct design. C. Performance of Frame Recompression Scheme First of all, the lossless frame recompression scheme is integrated to the proposed hierarchical ME algorithm and the performance is evaluated. Until now there are no FRC schemes based on hierarchical ME scheme, hence we compare the performance of the total ME scheme with the designated RFC strategy integrated. Since the FRC scheme proposed in [12] aims to be integrated into FS ME scheme, the total performance of the FRC integrated motion search can be experimented. As is shown in Table VII, although the FRC scheme in [12] can reduce a considerable amount of bandwidth for FS video encoder, the total bandwidth of the ME scheme is still too large for beyond HD encoder chip design. Therefore, we propose the TABLE VII PSNR, BIT RATE AND BANDWIDTH COMPARISON OF FRC INTEGRATED ME SCHEME FS ME + FRC in [12] Proposed Her. ME + FRC Sequences Night (720p) Shields (720p) Crosswalk Parkscene Woman PSN R inc. Total Bandw idth (M Bytes) PSN R inc. Total Bandw idth (M Bytes) % % % % % % % % % %

10 hierarchical ME algorithm combined with the hierarchical data recompression strategy in order to reduce more DRAM bandwidth. And according to the results in Table VII, the proposed FRC integrated hierarchical ME algorithm is able to achieve much lower bandwidth than the former scheme. The quality loss of the proposed algorithm compared with FS based scheme comes from the hierarchical ME algorithm in order to achieve hardware efficiency. Since the proposed FRC scheme is based on lossless compression, there is no quality loss happening for frame recompression. D. Performance of Early Level Termination The performance of the early level termination strategy is tested by integrating it to the proposed hierarchical ME scheme with lossless frame recompression. In the comparison, the experimental results of the increasing of PSNR values and bit rate compared with FS, as well as the total bandwidth are examined, as is shown in Table VIII. In addition, the bandwidth reduction rate ( BW) is also calculated by comparing the final total scheme with the early level termination method to the original proposed scheme without terminating strategy. According to the results of Table VIII, the proposed early level termination strategy achieves a further 20% bandwidth reduction with sacrificing little quality loss when integrated into the proposed ME scheme. Although the quality loss becomes larger when dramatically high motion occurs for the sequences of shield and woman, this kind of quality loss is still acceptable and the total scheme also achieves the best ME performance in comparison with the other previous works in the following discussion. E. Performance comparison of Total Scheme The performance of the total scheme which includes the hierarchical ME structure, the frame recompression scheme as TABLE VIII PSNR, BIT RATE AND BANDWIDTH COMPARISON BY ADDING THE EARLY LEVEL TERMINATION STRATEGY Proposed Her. ME + FRC Proposed Her. ME + FRC + ELT Sequences Night (720p) Shields (720p) Crosswalk Parkscene Woman PSN R inc. Total Bandw idth (M Bytes) PSN R inc. Total Bandw idth (M Bytes) % % % % % % % % % % % % % % % % % % % % BW -20.6% -21.4% -21.7% -21.9% -22.8% -18.8% -18.3% -21.7% -21.4% -19.6% well as the early level termination strategy is tested by comparing it to CBME and PMRME. The total proposed scheme and the two former works are implemented and the coding results of PSNR values, bit rate as well as bandwidth are compared with FS, as is shown in Table IX. The experiment is modeled based on QP (quantization parameter) value of 24. According to the experiment results, the following two observations can be listed by comparing the ME performance and total bandwidth separately: (1) According to the experimental results, the performance of CBME is not as good as the proposed one due to limited Sequences Night (720p) Shields (720p) Crosswalk Parkscene Woman TABLE IX PSNR, BIT RATE AND TOTAL BANDWIDTH COMPARED WITH FULL SEARCH (QP: 24; RDO: OFF; FRAME STRUCTURE: IPPP) Direct Design CBME PMRME Proposed Total Scheme Bit Total PSNR Total PSNR Total PSNR rate Bandwidth inc. Bandwidth inc. Bandwidth inc. inc. (M Bytes) (M Bytes) (M Bytes) (%) PSN R inc. Total Bandwidth (M Bytes) % % % % % % % % % % % % % % % % % % % % % % % % % % % % % %

11 60% 50% 40% 30% 20% 10% 0% CBME PMRME Proposed 40% 35% 30% 25% 20% 15% 10% 5% 0% Total BW/BW of Dir. Design (%) CBME PMRME Proposed (a). Comparison of bit rate increasing. (b). Comparison of total bandwidth. Fig.10. Comparison of bit rate increasing and total bandwidth between CBME, PMRME and the proposed scheme. prediction range. The results show that the bit rate increasing of CBME gets rather large when high motion occurs, which leads to the huge reduction of coding efficiency for the total encoder system. Since the proposed scheme aims to ensure the ME quality for high motions that exist in beyond HD sequences, it proposes the hierarchical ME strategy with the integration of large area searches. As a result, it is able to achieve a much less bit rate increasing than CBME for high motion sequences. In addition, the proposed algorithm applies the refining strategy for high levels during the hierarchical ME process. Hence it also outperforms PMRME by comparing the ME quality. Fig. 10 (a) shows the comparison of bit rate increasing between CBME, PMRME and the proposed scheme for different sequences, according to the results in Table IX. (2) For the bandwidth, the proposed total scheme achieves more than 80% bandwidth reduction, and it outperforms PMRME in bandwidth reduction according to the experimental results. The bandwidth is more than that of CBME for some sequences because the proposed scheme applies much larger search range in order to ensure the ME performance for high motions with sacrificing some bandwidth consumption. Fig. 10 (b) shows the rates of total bandwidths of three methods compared with the ones of direct design for different sequences. Besides, since the proposed ME architecture always stores the original reference frames in the low level, it is very suitable to apply the bit truncation method to the average values in high levels without causing any drift error between reference frames. Therefore, the bandwidth will be further reduced in our future work. F. Performance of the total scheme with different QP values The total proposed scheme is simulated with different QP values for all of the 4320p and 2160p sequences. Table X shows the increasing of PSNR values and bit rate, as well as total bandwidth reduction by comparing the proposed scheme to direct design for QP values of 24, 28, 32, 36. When QP value rises, the details of reference frames will lose. Hence the compression for the frames becomes easier. The Sequences TABLE XI BDPSNR AND BD BIT RATE COMPARISON CBME PMRME Proposed BD bit rate (%) BDPS NR BDPS NR BD bit rate (%) BDPSN R BD bit rate (%) Sequences PSNR inc. TABLE X PSNR, BIT RATE AND TOTAL BANDWIDTH COMPARED WITH FULL SEARCH (QP: 24; 28;32;36) QP=24 QP=28 QP=32 QP=36 Bit Bandwi PSNR Bandwidth PSNR Bandwidth PSNR rate dth inc. Reduction inc. Reduction inc. inc. Reducti (%) (%) (%) on (%) Bandwidth Reduction (%)

experimental results show that the bandwidth reduction of the proposed scheme grows as QP value becomes high.

With the PSNR and bit rate results of 4 QP values, we can calculate the BDPSNR and BD bit rate by getting the average differences between the FS scheme and the proposed one, according to the method

Finally we can compare the calculation results between the three schemes, as is shown in TABLE XI.

12 experimental results show that the bandwidth reduction of the proposed scheme grows as QP value becomes high. On the other hand, the ME quality degrades due to the increasing of blocky effect for higher QP, which is also described in [10]. With the PSNR and bit rate results of 4 QP values, we can calculate the BDPSNR and BD bit rate by getting the average differences between the FS scheme and the proposed one, according to the method in [25]. Then we also calculate the values of CBME and PMRME by modeling the simulations based on 4 QP values for three sequences. Finally we can compare the calculation results between the three schemes, as is shown in TABLE XI. From the table, the proposed scheme can also achieve better ME performance than the two former works in the comparison of combining the RD costs together. G. Subjective Comparison In the subjective comparison, the total scheme as well as two former works is encoded and the typical reconstructed frames of three sequences are compared. According to the figures in Fig. 11, there are no distinct differences between the three works. From the experimental results shown in Table IX, the PSNR values of three schemes are also very close. Therefore, the subjective and objective TABLE XII THE HARDWARE CLOCK CYCLES COMPARISON Sequences CBME PMRME Proposed Clock cycles for each MB > differences between the three schemes are not distinctive. So the main differences of three schemes are the final coding bit rates. As is concerned in the performance comparison, the total proposed scheme is able to restrain the total coding bit rate increasing especially for high motion sequences and achieves better ME performance than the two former works. H. Total running time comparison Since the proposed algorithm is hardware-oriented, in our work, the JM software is only used to model the coding performance and bandwidth consumption, but not optimized for execution time reduction. So we did not directly compare the software running time. Instead, we calculate the expected hardware clock cycles of the proposed ME scheme and compare it to the two previous works. The bottleneck of the running clock cycles for the proposed scheme is the ME calculation for each MB since the whole flow CBME PMRME Proposed Fig.11. (a). Sequence: Woman, 16 th frame, QP: 24. CBME PMRME Proposed Fig.11. (b). Sequence:, 11 th frame, QP: 24.. CBME PMRME Proposed Fig.11. (c). Sequence:, 3 rd frame, QP: 24. Fig.11. Comparison of reconstructed frames between CBME, PMRME and the proposed scheme. (QP: 24)

A LOW-COMPLEXITY AND LOSSLESS REFERENCE FRAME ENCODER ALGORITHM FOR VIDEO CODING

2014 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP) A LOW-COMPLEXITY AND LOSSLESS REFERENCE FRAME ENCODER ALGORITHM FOR VIDEO CODING Dieison Silveira, Guilherme Povala,