Fast Coding Unit Partition Search Satish Lokkoju # \ Dinesh Reddl2 # Samsung India Software Operations Private Ltd Bangalore, India. l l.satish@samsung.com 2 0inesh.reddy@samsung.com Abstract- Quad tree based encoders do brute force search for finding out the best partition for Coding Unit (CU). This brute force search performs encoding for all the possible block sizes and selects the partition size that gives best compression. This search along with inherent complexity of the latest encoders makes it extremely difficult to attain real time performance of 30 fps and low power. The solution to this problem is to perform a low complexity analysis of the Coding Unit and suggest the partition of the CU based on the available CU characteristics without performing entire encoding to estimate the cost. The present paper describes a method to do this using Sum of Absolute Difference, hereby SAD, and gradient information of the Coding Unit. We show that the presented method results in 3x faster encoding when compared to the brute force algorithm with small increase in bitrate (approximately 5% increase in worst case) and no change in subjective quality. The complexity bitrate trade off and the res ult BD-PSNR values of this method are also presented. Keywords- Quad tree, REVC, Coding Unit Partition, SAD, Mode decision. 1. INTRODUCTION Video encoders have always been one of the resource consuming processes in the modern consumer electronics devices. The demand for more compression for sustaining a number of streaming solutions has lead to an increase in the complexity of the encoders. Though there are lot of advancements in processor technologies that directly result in more power it is not sufficient enough to get real time performance with latest encoders. Thus, there is a need for devising a method that results in reduction in complexity with less or no increase in bitrate. All modern encoders have a number of tools to attain the desired compression. One among them is the selection of mode based on the spatial and temporal characteristics, such as texture and motion vectors respecti vely, that give best compression. The sele ction is. usually done using brute force search by encodmg and calculating the cost of all different possible modes and selecting the best among them. Due to huge complexity of the brute force search this has been the target area of our algorithm. A generic multi depth quad tree based video codec gives us a lot of flexibility in terms of partitioning of the block. It also increases complexity. For example, for a Coding Unit of size 64x64 with minimum possible partition size 4x4, there are 18446744073709551616 ways in which the CU can be partitioned and we need to perform a minimum of 340 encoder-decoder cycles, on blocks of different sizes, to get the best possible partition information. This clearly is a time consuming exercise. This paper presents a generic method to partition a Coding Unit to different blocks based on the spatial and temporal characteristics of the unit. The complexity bitrate trade-off that can be achieved is also explained. The rest of the paper is divided into the following topics. Section IT gives a brief overview of the work that is done in this field and also the challenges that are faced when extrapolating those methods to the latest quad tree based video encoders. Section TIT will explain the proposed method in detail. Section IV will give the results of the tests and Section V will give some details about the future directions and conclusion. IT. CODING UNIT PARTITION Fast mode detection is one of the important ways in which encoder complexity can be reduced without compromising in the bitrate. The Coding Unit is partiti ned into blocks of different sizes to generate a layout that gives best compression. Discrete cosine transform (DCT) is one of the main tools that are present in almost all the modern encoders. The DCT exploits the spatial redundancy in a Coding Unit. Areas with uniform texture are best possible candidates for DCT. Thus our problem reduces to finding the uniform areas of a Coding Unit and partitioning them into blocks and performing the encoding on that layout. Texture analysis has been used to gauge the continuity of the Coding Unit. Filtering operation using Sobel operator is used in [1]. The gradient information thus generated is used to select the block size. The problem with this approach is that an area of the Coding Unit with small variations in the surrounding pixels is still marked as non-uniform. Logarithm of the ratio of energy of pixels to energy of perfectly non-uniform block [2] may also be used to decide on identifying uniform areas. This information is used against a threshold to calculate the block partition layout. These methods may not be used. in quad tree based video codecs directly as the relative variations in the energies of the blocks of different sizes may be small and it becomes difficult to differentiate one block from another. TIT. PARTITION SIZE DETERMINATION AND TWO PASS SEARCH The proposed methods employs two different techniques, based on the type of the Coding Unit i.e. whether intra or inter. 978-1-4673-5604-6112/$3l.00 20 12 IEEE 000315
For the intra Coding Units, prediction within the rrame is possible. Thus the spatial characteristics of the Coding Unit will decide on the way it is partitioned. The gradient information of the Coding Unit is a nice representation of the uniformity/continuity of the Coding Unit. The gradient of the Coding Unit is obtained by sobel filtering on the Coding Unit. The sobel filer is a derivate operator that provides gradient information in either of the directions based on the filter coefficients. ( 1 2 1) Sobe= 0 0 0-1- 2 ( 1 0-1) Sobely= 2 0-2 1 0-1 surrounding Coding Units. The predicted motion vectors thus generated are added to the collocated Coding Unit in the reference frame. The prediction unit generated is subtracted from the current Coding Unit and the SAD is generated. For example, consider a block 'r' as shown in figure 1. The collocated SAD for this block is calculated by obtaining the motion vector prediction from the surrounding blocks and adding it to the coordinates of the collocated block in the previous rrame, in this case's'. :'dictcd molioo s vector Gx = Pi-l,j+l + 2 * Pi,j+l + Pi+l,j+l - Pi-l,j-l (n-i) th rrame n th rrame Gy = -2 * Pi,j-l - Pi+l,j-l Pi+l,j-l + 2 * Pi+l,j + Pi+l,j+l - Pi-l,j-l -2 * Pi-l,j - Pi-l,j+l Gradient = Gy + Gx Once the gradient is obtained for 64x64 Coding Unit, the uniformity of the block is decided based on the standard deviation of the constituent blocks. For example, to decide if a 16x16 area in a Coding Unit is uniform, the standard deviation of the constituent 8x8 blocks are calculated and if it is less than a threshold, the blocks are marked as 16x16. The same process is applied for all the block sizes starting for 4x4 to 64x64 iteratively. The threshold is different for different block sizes and it is calculated after extensive tests on a variety of video streams. This method overcomes the problem with small variations in the surrounding pixels by using standard deviation values of the constituent blocks to arrive at the decision. The said method takes care of the intra frames where only spatial prediction, and thus the gradient analysis, is used. This method may not be used for inter frames as the blocks in the inter frame can use either spatial or temporal prediction. Thus, a combination of SAD and gradient analysis is appropriate. First consider the inter blocks in a inter frame. Best matching block from the previous reference frame is subtracted from the current frame and the residual in transformed and encoded. The best partition information cannot be obtained unless all the possible blocks are inter coded and cost determined. This includes motion vector prediction, motion estimation, transform and entropy coding which is cumbersome. Our method uses a combination of motion vector prediction, SAD estimation and gradient information to arrive at a best block partition. This method is described below. The motion vectors of a Coding Unit are closely related to its surrounding units. This information can be used to provide an approximate location in the reference rrame where the best match of the current Coding Unit can be found. Thus, the first step in estimating the inter mode block partition is to calculate the motion vector prediction of the Fig. I shows an example motion vector prediction for calculating the inter mode SA D. Standard deviations of the blocks are calculated using the SAD of the constituent blocks in a quad tree fashion. This gives information regarding the continuity that is the main criterion for deciding the block size. The standard deviation calculated is used against a threshold to decide the block size for inter mode. The threshold for inter mode blocks is calculated by extensive tests on different streams taking into account the quantization parameter used that is directly related to the quality of reference frame reconstructed and thus the SAD. As the Coding Units in the inter rrame may also be coded with intra modes, a check is performed on the gradient of the Coding Unit using the same method that is used in the case of Intra frames. The initial partition layout is thus generated. Once the block partition layout is determined, we need to perform inter mode and intra mode search on the constituent blocks of the layout to fine tune the partition layout. This method offers the flexibility of controlling the complexity with respect to bitrate. For example, consider the layout that is generated after the inter mode and intra mode partition search is performed is as shown in the figure below. BJ 64x64 Ii x DyDzD 32x32 ED p O ro so 16xl6 a 0 b 0 Co do 8x8 Fig. 2 shows an example Coding Unit partition layout with individual blocks color coded. 000316
The brown coloured blocks of the 64x64 Coding Unit are the 32x32 blocks which are hereby named as x, y and z as shown in the figure 2. Similarly, green coloured blocks are 16x16 blocks of the 2n d 32x32 blocks and named as p, r and s as shown in the figure 2. Same is the case with 4 blue coloured 8x8 blocks of the 2n d 16x16 block which are named as a, b, c and d. Once the Coding Unit layout is determined, the search is performed on the block size according to the layout. The first block is marked as 32x32 (x in figure 2) so the search is performed for the best intra mode if it is Intra frame and among intra or inter if it is Inter frame. The search is performed by doing one complete encoder -decoder cycle on the block and by finding the least cost. Now we move to the second block that has a combination of 8x8 (a, b, c and d) and 16x16 blocks (p, rand s). The first sub block of the second block has a depth of 3 * that corresponds to 8x8. So the brute force search is performed on this block for all prediction types, modes and the best mode is selected. This same process is performed on all 8x8 blocks of the 16xl6 block. Finally, these blocks are combined to form a 16x16 block and the cost is calculated. This cost is compared to the sum of the costs of the best modes of the 8x8 blocks. This comparison is done up to two levels, i.e. the best depth given by the analyse module and a depth plus one i.e. block size just greater than the one that is obtained using the partition mode detection described above. This significantly decreases the time consumption because the number of brute force searches is reduced. Now we move on to the 2nd sub block of the second block. As this block depth is 2 corresponding to 16x16(p in figure 2), the search is performed at this level and best is chosen. The same process takes places for 3rd (r in figure 2) and 4 th (s in figure 3) blocks that are 16x16 blocks. Now all these 16x16 blocks are combined to form a 32x32 block and search is performed at that level. These costs are compared and best block size is chosen. Finally search is performed at 64x64 level (as 3 blocks of the 64x64 Coding Unit are marked as 32x32 blocks) and mode with least cost is chosen. This is compared to the sum of the costs of best modes of the individual blocks. "Two pass" search significantly reduces the number of brute force searches performed, because initial information about the partition size is obtained by intra inter partition size search described above. For example the types of searches performed for the Coding Unit with layout as shown in the figure are, 32x32 for block x, yand z o 64x64 search for the complete Coding Unit and compare the cost with the sum of the least costs of the constituent sub blocks. 4 8x8 searches for sub blocks a, b, c and d o 16xl6 search that includes the 4 8x8 blocks. The least cost of 16x16 block is compared with sum of least costs of the constituent 8x8 blocks. 3 16xl6 searches for p, q and r o 32x32 search that includes 3 16x 16 subblocks p, q and rand 4 8x8 sub-blocks a, b, c and d. The least cost of 32x32 is compared with sum of least costs of 4 8x8 sub blocks and 3 16x16 sub blocks. A total of 13 brute force searches are performed when compared to 340 searches that need to be performed for a full search. The best possible scenario for this algorithm is when the partition search module gives the correct layout map.i.e. in which blocks with modes 32x32 are marked with either depth 1 (32x32) or 2(16xI6) by the partition search module. As, in both these cases search is performed for 32x32 block. The worst case is when a 32x32 block is marked as 4x4 or 8x8 by the partition search module that results in less compression. The performance degradation is more in the case when a 4x4 block is marked as 32x32 than a case where a 32x32 block is flagged as 8x8 or lower sizes. From the analysis we performed on different streams, it is observed that a higher block size with uniform texture is most likely correctly marked in the partition search module that results in better compression. The complexity vs. bitrate trade off can be achieved by manipulating the block sizes for which "two pass" search is enabled. By disabling the two pass search for 32x32.i.e. (for blocks that are marked as 32x32 no further search at 64x64 level is performed because the partition search module is sensitive enough to mark 32x32 block correctly in most of the cases). So, once a block is marked as 32x32 no further search is performed for 64x64 block in. This same technique can be applied to all the other block sizes according to the performance required. The results with different settings are discussed in the next section. TV. RESULTS The tests are performed on a set of test streams with different motion and texture variations. The results of the test streams with high motion namely BasketballDrill, BQMall, PartyScene, RaceHorses is presented here. Two different configuration settings, Random access and Low delay, are used to cover a broad range of use cases. These are similar to the settings "random access and low delay" configurations of the HEVC Test Model (HM) Reference Software. Random access configuration includes support for B frames and large motion search area and low delay corresponds to only I and P frames with low coding delay. The results are shown in Table I and Table 2. The Table 1 corresponds to configuration with hierarchical B coding enabled and other with hierarchical B coding disabled. On an average random access configuration gives a 1.23 increase in bit rate with a 0.19 db decrease in Y PSNR and 0.12 db decrease in U and V PSNR values. The complexity is reduced by 3x on an average. * The quad tee partition ofthe Coding unit is partitioned recursively. So the 64x64 block ofthe coding unit is said to have a depth ofo. Similarly, the 4 32x32 blocks ofthe 64x64 have a depth of I, 16xl6 blocks have a depth of2, 8x8 blocks have a depth of 3 and 4x4 blocks have a depth of 4. 000317
y u V QPISlice kbps psnr psnr psnr EncT SaskelbaliDrili 22 4375.06 39.58 42.86 43.37 1400.04 27 2107.17 36.36 40.42 40.46 1280.17 32 1000.23 33.39 37.69 37.36 1183.76 QPISlice kbps psnr psnr psnr SaskelbaliDrili 22 2286.44 40.48 43.28 43.03 27 1231.64 36.84 40.47 39.85 32 616.44 33.38 37.86 36.82 EncT 283.75 237.81 208.19 37 530.21 31.05 35.34 34.86 1128.12 37 313.62 30.52 35.70 34.28 196.41 SQMall 22 4379.97 39.57 43.38 44.80 1623.10 27 2116.27 36.89 41.44 42.43 1515.38 32 1063.15 33.94 39.30 40.03 1432.74 37 588.29 31.37 37.38 37.95 1394.86 PartyScene 22 7706.34 37.44 41.24 42.28 1310.17 27 3478.21 33.89 38.72 39.60 1152.33 32 1600.98 30.68 36.31 37.06 1063.98 SQMall 22 4476.75 38.20 41.11 41.81 27 1942.68 33.81 38.98 39.70 32 585.57 29.72 37.33 37.88 37 179.15 26.37 36.24 36.32 PartyScene 22 2887.81 37.77 39.97 40.49 27 1185.27 33.47 37.04 37.53 32 484.65 29.70 34.60 35.07 309.43 255.93 205.52 175.70 258.93 201.25 165.48 37 780.18 28.00 34.38 34.95 1011.64 37 205.47 26.79 32.68 33.17 144.09 RaceHorses 22 5478.69 38.37 41.14 42.53 1177.40 27 2400.65 35.00 38.71 40.27 1039.68 32 1119.05 31.96 36.38 38.04 930.99 37 554.48 29.54 34.60 36.11 858.30 RaceHorses 22 1805.35 39.66 41.10 42.09 27 910.45 35.69 38.21 39.25 32 433.25 32.02 35.54 36.54 37 208.43 29.27 33.36 34.24 170.76 144.92 124.99 113.72 Table 1: Full bruteforce searchfor Random access configuration Table 3: Full bruteforce search for low delay configuration. QPISlice kbps psnr psnr psnr EncT SaskelbaliDrili 22 4496.98 39.36 42.64 43.12 811.84 27 2124.82 36.17 40.22 40.23 700.42 32 998.37 33.25 37.54 37.18 593.70 QPISlice kbps psnr psnr psnr SaskelbaliDrili 22 2531.40 40.22 42.93 42.71 27 1329.12 36.62 40.15 39.56 32 659.59 33.08 37.54 36.45 EncT 138.10 120.27 95.66 37 525.37 30.93 35.23 34.75 526.63 37 330.28 30.23 35.37 33.93 79.72 SQMall 22 4601.56 39.34 43.20 44.59 944.46 27 2182.30 36.60 41.30 42.29 798.63 32 1071.64 33.64 39.19 39.91 723.78 37 584.62 31.10 37.30 37.85 656.57 PartyScene 22 8105.59 37.26 41.03 42.04 875.11 27 3603.63 33.60 38.51 39.38 673.78 32 1623.21 30.34 36.17 36.93 538.19 37 767.30 27.67 34.28 34.90 456.18 RaceHorses 22 4496.98 39.36 42.64 43.12 811.84 27 2124.82 36.17 40.22 40.23 700.42 32 998.37 33.25 37.54 37.18 593.70 37 525.37 30.93 35.23 34.75 526.63 SQMall 22 4816.47 38.07 40.93 41.64 27 2041.54 33.58 38.90 39.57 32 581.22 29.36 37.33 37.81 37 177.45 25.96 36.28 36.27 PartyScene 22 3307.51 37.53 39.64 40.12 27 1301.96 33.09 36.86 37.27 32 526.17 29.17 34.40 34.83 37 209.37 26.35 32.50 32.92 RaceHorses 22 2067.34 39.56 40.97 41.88 27 1018.12 35.53 37.98 39.05 32 476.35 31.81 35.30 36.31 37 221.01 29.02 33.08 33.97 187.98 161.27 108.10 78.82 142.00 110.12 77.93 58.60 101.60 86.41 67.75 54.78 Table 2: Reduced search (discussed algorithm) for Random access configuration (hierarchical B-frames). Table 4: Reduced search (discussed algorithm) for Low Delay configuration. One the other hand, the low delay configuration results in a 5.88 percent increase in bit rate with 0.28 db decrease in PSNR value for Y and 0.12 and 0.13 db decrease in PSNR values for U and V respectively. These results are presented in Table 3 and 4. 000318
40 38 >;- 36 0:: Z 34 (j) 0... 32 30 28 Bitrate vs PSNR for Random Access Configuration W'" -Present ---+-- Original 26 28 30 32 34 36 38 40 1 Q*log(bitrate) Figure 3: The Bitrate vs. PSNR graph for the BasketballDrill Stream with Random access configuration. less than 1 db. Tn the future this method can be extended to find the global best possible partition by taking into account the overall layout of the Coding Units in a frame or multiple frames thus reducing the bitrate. REFERENCES [I] Efficient Block-Size Selection Algorithm for Inter frame coding in H.264/MPEG-4 A YC, Andy C. Yu [2] Advanced Block Size Selection Algorithm For Inter Frame Coding inh.264/mpeg-4 A YC, Andy C. Yu and Graham R. Martin [3] Efficient Intra- and Inter-mode Selection Algorithms for H.2641 AVC, Andy C. Yu, Ngan King Ngi, Graham R. Martin [4] Low Complexity H.264 Yideo Encoding, Paula Carrillot, Hari Kalvat, and Tao Pint. tdept. of Computer Science and Technology, Tsinghua University, Beijing, China 1Dept. of Computer Science and Engineering, Florida Atlantic University, Boca Raton, FL, USA [5] Low Complexity Skip Prediction for H.264 through Lagrangian Cost Estimation, C. S. Kannangara, 1. E. G. Richardson, M. Bystrom, J. Sol era, Y. Zhao, A. MacLennan and R. Cooney The graph clearly illustrates that the present algorithm RD curve closely follows the Original. Bitrate vs PSNR for Low Delay 42 40 38 >;- 36 0:: Z 34 (j) 0... 32 30... -Present ---+-- Original 28 26 22 24 26 28 30 32 34 1 Q*log(bitrate) Figure 4: The Bitrate vs. PSNR graph for the BasketballDrill Stream with Low delay configuration. Please note that logarithm of bitrate is taken for both the graphs as the variation in bitrate for change in QP is non linear. The QP values 16, 20, 24 and 28 are used to collect the data. V. CONCLUSION The test results clearly show that there is a significant reduction in complexity for a small trade-off in bitrate and PSNR values. There is a 3x decrease in complexity on an average with less than 5 % increase in bitrate on an average. The complexity can be further reduced to'!. of the present by disabling the two pass search for 32x32 and 64x64 block sizes. This average decrease in PSNR values in this case is 000319