FOR THE LAST few decades, many image interpolation

274 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 3, MARCH 2011 Video Super-Resolution Algorithm Using Bi-Directional Overlapped Block Motion Compensation and On-the-Fly Dictionary Training Byung Cheol Song, Senior Member, IEEE, Shin-Cheol Jeong, and Yanglim Choi Abstract This paper presents a video super-resolution algorithm to interpolate an arbitrary frame in a low resolution video sequence from sparsely existing high resolution keyframes. First, a hierarchical block-based motion estimation is performed between an input and low resolution key-frames. If the motion-compensated error is small, then an input low resolution patch is temporally super-resolved via bi-directional overlapped block motion compensation. Otherwise, the input patch is spatially super-resolved using the dictionary that has been already learned from the low resolution and its corresponding high resolution key-frame pair. Finally, possible blocking artifacts between temporally super-resolved patches and spatially super-resolved patches are concealed using a specific de-blocking filter. The experimental results show that the proposed algorithm provides significantly better subjective visual quality as well as higher peak-to-peak signal-to-noise ratio than those by previous interpolation algorithms. Index Terms Bi-directional overlapped block motioncompensation, dictionary, hierarchical motion estimation, key-frames, super-resolution, training. I. Introduction FOR THE LAST few decades, many image interpolation algorithms have been developed to display high quality scaled images on cutting-edge digital consumer applications such as high-definition televisions (HDTV), digital still cameras (DSC), and digital camcorders. Traditional interpolation methods such as bilinear interpolation, bi-cubic, and cubic convolution usually suffer from several types of visual degradation, e.g., jagging and stair-case artifacts. In order to overcome the above-mentioned problems, Li and Orchard [1] proposed the new edge-directed interpolation (NEDI), which makes use of the geometric duality between the Manuscript received March 29, 2010; revised June 22, 2010; accepted August 20, 2010. Date of publication October 14, 2010; date of current version March 23, 2011. This research was supported by DMC R&D Center, Samsung Electronics Co., Ltd, and it was financially supported by the Ministry of Knowledge Economy and the Korea Institute for Advancement of Technology through the Human Resource Training Project for Strategic Technology, and by the National Research Foundation of Korea Grant funded by the Korean Government, under Grant 2009-0071385. This paper was recommended by Associate Editor W. Zhu. B. C. Song and S.-C. Jeong are with the School of Electronic Engineering, Inha University, Incheon 402-751, Korea (e-mail: bcsong@inha.ac.kr; shinchul61@inha.edu). Y. Choi is with the DMC Research and Development Center, Samsung Electronics Company, Ltd., Suwon 443-742, Korea (e-mail: yanglimc@samsung.com). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCSVT.2010.2087454 1051-8215/$26.00 c 2010 IEEE covariance in the low resolution (LR) and high resolution (HR) images. Wang and Ward [2] proposed an orientation-adaptive bilinear interpolation algorithm, and Kwak et al. [3] presented an edge-directional cubic convolution scaler. Although the latter two methods provide better peak-to-peak signal-to-noise ratio (PSNR) than those provided by the conventional methods for clear edges, their improvement in terms of subjective quality is still limited because they are structurally weak against textures and non-linear edges. Therefore, Zhao et al. [4] insisted that a proper solution for better visual quality is to combine the well-known linear interpolation algorithms with sharpening techniques such as luminance transient improvement (LTI) and peaking. However, such an approach cannot be an ultimate solution for generating real high frequency (HF) details from LR images. According to [4], although post-processing such as LTI provides seemingly good visual quality owing to sharpening effect, it tends to deteriorate PSNR performance and cause noise boosting phenomenon. Moreover, since most digital image applications such as HDTV and DSC already possess sharpening or deblurring tools used as post-processing for achieving image enhancement, a preference for image interpolation is to produce the up-scaled images close to the original HR images by reconstructing as many reliable HF components as possible. One promising approach is to obtain a HR image from multiple LR images. Recently, such a resolution enhancement technology, which is called super resolution (SR) image reconstruction in [5] [7], has been one of the most researched areas. A typical SR is an image reconstruction method based on multiple LR images where the registration is a very important step to the success of SR image reconstruction. Accurate subpixel motion estimation between adjacent LR images should be performed for successful registration because general registration methods utilize robust motion models representing multiple object motion, occlusions, transparency, and so on. However, this requires not only considerable computational complexity, but also the performance of some registration algorithms is not guaranteed in certain environments where the resulting motion between two LR images can be very complex. For example, a simple motion model which could represent only translation and rotation may not properly describe the real motion for all regions of the sequence [8]. As an approach to avoid the above-mentioned problem, many example-based or learning-based SR algorithms have

SONG et al.: VIDEO SUPER-RESOLUTION ALGORITHM USING BI-DIRECTIONAL OVERLAPPED BLOCK MOTION COMPENSATION 275 been developed recently. In the off-line learning stage, most example-based SR algorithms [9] [13] constructed a dictionary composed of a large number of LR and HR patch pairs. In the on-the-fly inference stage, an input LR image was split into either overlapping or non-overlapping patches, and either one best-matched patch or a set of the best-matched LR patches for each input LR patch was then selected from the dictionary. The corresponding HR patches were used to reconstruct an output HR image. However, the existing algorithms were computationally intensive in finding the best match of LR HR patch pair from a huge dictionary. Furthermore, best-matched but incorrect patches often degraded the reconstruction results [9] [12]. Li et al. [13] partially mitigated the above-mentioned problems by classifying both LR and HR patches using the vector quantization technique in the learning stage. Still, Li s algorithm did not provide acceptable visual quality because its limited number of categories caused a blur artifact. Unfortunately, it is very hard to reconstruct a high quality HR image from single or multiple LR images even if we adopt computational heavy SR methods [12], [13]. Brandi et al. presented an interesting SR approach [14] for reversedcomplexity video coding schemes such as distributed video coding. They defined the so-called key frames (KF) that sparsely exist in a video sequence and have HR resolution. The remaining frames in the video sequence, i.e., non-key frames (NKF) had LR resolution. Brandi et al. took advantage of the fact that few KF (encoded at HR) may provide enough HF information to the up-scale NKF (encoded at LR). However, Brandi s method rarely found true motion information because it made use of the conventional full-search motion estimation. Therefore, this paper presents a hybrid SR algorithm where each LR patch is adaptively selected between a temporally super-resolved patch and a spatially super-resolved patch using adjacent HR KFs. All the LR frames are initially up-scaled for achieving motion compensation in HR resolution. For each input LR patch, hierarchical motion estimation is bidirectionally applied to forward and backward LR KFs to obtain as true motion vectors (MVs) as possible. If the motioncompensated error is smaller than a threshold, then the input LR patch is super-resolved using bi-directional overlapped block motion compensation (OBMC). Otherwise, the input LR patch is spatially super-resolved using the on-the-fly trained dictionary that was learned from the neighboring LR and HR KFs. Finally, since the blocking artifact often occurs at boundaries between temporally super-resolved patches and spatially super-resolved patches, a simple de-blocking filter is applied to the patch boundaries. The simulation results prove that the proposed algorithm outperforms the existing interpolation algorithms in terms of visual quality as well as PSNR. Also, we show that if we employ an efficient compression scheme such as H.264, the proposed algorithm can be implemented with reasonable overhead of storage space. The rest of this paper is organized as follows. Sections II and III present the motion-compensated SR algorithm and learning-based SR, respectively. We describe the hybrid SR algorithm in Section IV. Section V provides the experimental results. Finally, we conclude in Section VI. Fig. 1. LR sequence and HR KFs assuming the 0th and Nth frames are KFs. Forward motion estimation is performed between a target LR frame LR n and LR KF LR 0, and backward motion estimation is performed between the target LR frame and LR KF LR N. II. Motion-Compensated Super-Resolution A. Basic Concept This section presents a motion-compensated SR (MSR) where a target frame in the LR video sequence is interpolated by using forward and backward HR KFs closest to the target LR frame. Fig. 1 illustrates the HR KFs as well as the LR video sequence to be interpolated. LR n denotes the nth LR frame, i.e., target LR frame. Note that if a specific frame is a KF, then the LR KF always coexists with its corresponding HR KF as a pair. In the example in Fig. 1, the 0th and Nth frames are KFs. The KF interval N can be constant or variable. Also, since the HR KFs are a sort of side information for interpolating the LR video sequence, we have to determine N so that the side information can be minimized with a few constraints of global motion, shot change, memory cost, and so on. First, assuming that there is seldom significant global motion only for 1 or 2 s in a video sequence shot by a general camcorder user, we can set the lower bound of N to 1 or 2 s. Second, N can be flexibly determined depending on shot boundaries. Finally, we must minimize the burden of HR KFs on storage space. This is because even though the upper limit of storage media such as HDD and flash memory has been dramatically increased recently, the maximum recording time is still the most important specification of camera products. So, this paper sets a reasonable upper bound of storage overhead of HR KFs to 10%. As a result, if the compression ratios of HR KFs and LR sequence are given, N can be automatically determined according to the afore-mentioned storage overhead. Fig. 2 describes the proposed MSR to super-resolve LR n. First, two LR KFs, i.e., LR 0 and LR N are up-scaled in HR resolution, and LR n is also up-scaled in the same resolution. For this initial up-scaling, we can use linear interpolation such as bilinear interpolation, bi-cubic and cubic convolution. This paper employs cubic convolution with α of 0.5. Let HR0 U, HRn U, and HR N U be the three up-scaled frames for LR 0, LR n, and LR N, respectively. In summary, every NKF has three versions: original LR, up-scaled HR and super-resolved HR while every KF has different three versions: original HR, down-scaled LR, and up-scaled HR. Note that those up-scaled frames have the same resolution as the HR KFs, but they still have low visual quality due to lost HF information. After initial up-scaling, the bi-directional motion estimation (ME) is performed: forward ME between HRn U and HR0 U,

276 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 3, MARCH 2011 Fig. 2. Overview of the motion-compensated SR algorithm. and backward ME between HRn U and HRn U. We employ a hierarchical ME to find forward and backward MVs as close to true MVs as possible because those MVs can mitigate the blocking artifact after motion compensation has been carried out. The forward and backward MVs are determined on an M M patch basis. Finally, a super-resolved HR frame is obtained by applying bi-directional OBMC (BOBMC) to the HR KFs based on the forward and backward MVs to further improve visual quality. On the other hand, the MSR is based on HR KFs similarly to [14]. The difference is that the MSR super-resolves the current LR patch by replacing it with the best HR patch from the HR KFs, while [14] adds HF patch computed from the HR KF to the current LR patch. Additionally, the MSR utilizes hierarchical ME and bi-directional ME to find true MVs and employ OBMC to reduce visually annoying blocking artifacts. B. Hierarchical Motion Estimation Fig. 3 exemplifies the forward ME between the up-scaled frames of current LR frame and LR KF, i.e., HRn U and HR 0 U. First, the MV for an overlapping M M matching block is searched by using a rate-constrained ME, which is generally more useful than brute-force ME for deriving true MVs [15]. Note that the sum of absolute difference (SAD) is generally preferred as the distortion measure for an MV v =(v x, v y ) SAD(v)= HR U n (o x +i, o y +j) HR0 U (o x+i + v x,o y +j + v y ) i,j (1) where (o x,o y ) denotes the location of the matching block in the up-scaled frame and (i, j) is a pixel position in the matching block. For simplicity, o x and o y are omitted in the following block-level representations. The conventional rateconstrained ME algorithms find the best v for each matching block by minimizing the rate as well as the distortion, which is described as min SAD(v)+λR(v) (2) v where R(v) and λ denote the MV bit-rate of the matching block and the Lagrange multiplier, respectively. Note that Lagrangian minimization is often used to solve this kind of optimization problem. Here, is the search window of size [ w, w] [ w, w] where w is the upper bound of search range. The MV bit-rate can be computed by treating the blocks as if the MVs are being coded independently although they Fig. 3. (a) Overlapped matching block whose block size M M. (b) Forward motion estimation between HRn U and HR 0 U. The selected MV for the M M block is allocated to its central L L block. are coded differentially using a Huffman code in an MPEG video encoder. In order to maximize the ME performance and to concurrently reduce the computational burden, we adopt a rate-constrained fast full search algorithm presented in [15]. Finally, the selected MV for the M M block is allocated to its central L L block, as is shown in Fig. 3(b). Then, a local motion search with a smaller search window can be performed around the selected MV for each L L block. But the local search is skipped in this paper. In this way, the forward MVs of all the L L blocks are obtained. Similarly, the backward MVs are acquired. C. Bi-Directional OBMC Without loss of generality, we assume that MVs between up-scaled LR frames are statistically very similar to those between corresponding HR frames [16]. Therefore, we replace the unknown MVs for HR frames with the MVs obtained from the up-scaled LR frames. As mentioned above, direct motion compensation often causes a blocking artifact. In order to reduce such an artifact, we employ OBMC [17], [18] which was adopted for MPEG4 video coding standard [19]. In this paper, we propose BOBMC based on bi-directional MVs. The BOBMC is performed on a 4 4 block basis, i.e., L = 4. The pixel value at (i, j) of a certain 4 4 block is super-resolved by the following BOBMC: W C (i, j) p C (i, j)+w T (i, j) p T (i, j) UH MSR +W B (i, j) p B (i, j)+w L (i, j) p L (i, j) (i, j) = +W R (i, j) p R (i, j) (3) 0 i, j 3 where the weight matrices for the BOBMC are defined in Fig. 4. We derived the 4 4 weight matrices from 8 8 weight matrices of MPEG4 OBMC. In (3), p C (i, j), p T (i, j), p B (i, j), p L (i, j), and p R (i, j) are the pixel values at (i, j) in the motion-compensated blocks corresponding to current MV,

SONG et al.: VIDEO SUPER-RESOLUTION ALGORITHM USING BI-DIRECTIONAL OVERLAPPED BLOCK MOTION COMPENSATION 277 Fig. 4. Weight matrices for 4 4 block-based BOBMC. Fig. 5. The lth LR patch UL l located at (x, y) and its corresponding HR patch UH l. Note that U H l is the center region of the HR block which is located at (2x, 2y) and corresponds to UL l. MVs of the top neighbor block, the bottom neighbor block, the left neighbor block, and the right neighbor block, respectively. Note that each MV may be a forward MV or a backward MV. The direction is determined according to SAD. Among the two directional MVs of each block, the MV generating a smaller SAD is chosen as the best MV for the block. III. Learning-Based Super-Resolution The drawback of the MSR is that it cannot often provide acceptable visual quality due to non-translational motion, occlusion, inaccurate motion estimation and limited motion search range. When such temporal motion compensation does not work well, we employ a learning-based SR (LSR) in order to avoid the degradation of visual quality. The proposed LSR algorithm consists of an on-the-fly learning stage to construct a trained dictionary using adjacent LR and HR KFs, and it consists of an inference stage to generate a spatially super-resolved patch using the trained dictionary. A. On-the-Fly Learning Stage Note that this learning stage is performed only for the LR and HR KFs whenever the KF is input. First, all the possible LR and HR patch pairs of L L size are extracted from LR 0 and LR N and from HR 0 and HR N as in Fig. 5. Let UL l and U H l denote the lth LR and HR patch-pair among all the LR and HR patch pairs, respectively. Note that UH l is extracted from the center region of the HR block which is located at (2x, 2y) of a HR KF and corresponds to UL l at (x, y) of a LR KF. Each LR patch is extracted via proper overlapping with adjacent LR patches. In this paper, the L 1 pixels are overlapped between the neighbor LR patches in both directions. Simultaneously, the corresponding HR patches are extracted from the HR KFs. In general, SR performance highly depends on the matching accuracy of the input LR patch with candidate LR patches in the dictionary. It is known that Laplacian patch can provide better matching accuracy than the pixel-domain patch [12]. Thus, we produce the Laplacian patch of each LR patch by applying a 3 3 Laplacian operator to every pixel in the LR patch. Subsequently, Laplacian patches are normalized for further reliable matching. Let VL l denote the normalized Laplacian of UL l. Fig. 6. Clustering results. Here, the number of clusters is K. Each cluster possesses a single cluster center, and multiple LR and HR patch pairs. Then, we cluster similar LR and HR patch pairs. We apply K-means clustering based on V L to all the patch pairs. As a result, K V L cluster centers are obtained, and each cluster is indexed by its center. Fig. 6 shows the clustering results. Let U k,m L and U k,m H cluster, respectively. Note that U k,m H by the following equation: U k,m H (i, j) = be the mth LR and HR patches in the kth can be derived from U k,m L L 1 L 1 x =0 y =0 w k,m ij (x, y)u k,m L (x, y) (4) where (x, y) and (i, j) denote the pixel positions in the LR and HR patches, respectively. Our goal is to derive a common weight set W k, i.e., {wij k (x, y) 0 i, j, x, y L 1} for all LR and HR patches in the kth cluster such that the synthesis error by (4) is minimized. In order to seek such an optimal W k, we employ a popular LMS algorithm [20]. Finally, we can obtain the optimal dictionary {(VL k, W k ) 1 k K}. Note that the number of candidates in the optimized dictionary K is significantly smaller than the number of entire patch pairs extracted from LR and HR KFs. The LSR is a typical dictionary-based SR like [13]. The difference is that the weight set of the LSR is derived from LR and HR patches, while the weight set of [13] is produced from LR and HR Laplacian patches. So, the weight set of the LSR is generally more accurate than that of [13] because the dynamic range of the former is much less than that of the latter.

278 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 3, MARCH 2011 Fig. 8. Block diagram of the hybrid SR algorithm. Fig. 7. Result images synthesized by (a) MSR and (b) LSR. (c) Original frame, i.e., the 15th frame of Foreman sequence. B. Inference Stage The normalized Laplacian V L,in for an input LR patch U L,in is derived. One best-matched patch for the V L,in is exhaustively searched from the dictionary. In this paper, SAD is employed as the distortion measure for matching. So, SAD between V L,in and every candidate in the dictionary is computed and the Laplacian patch candidate having minimum SAD is chosen as the best match. Let W k best be the weight set corresponding to the best-matched patch. From W k best and (4), we can produce a spatially super-resolved HR patch from U L,in U LSR H L 1 (i, j) = L 1 x =0 y =0 w k best,ij (x, y)u L,in (x, y). (5) In other words, since the HR patch is synthesized by convolving the selected weight set with the input LR patch, the LSR is a sort of spatial filter. IV. Hybrid Super-Resolution In this section, we propose a hybrid SR where the MSR is applied to the well-motion-compensated region, and the LSR is applied to the poorly-motion-compensated region. Fig. 7 shows the images synthesized by the MSR and the LSR for a target frame, i.e., the 15th frame of Foreman sequence when KF interval N is 30. The 0th and 30th frames of the sequence are chosen as KFs. In this experiment, K for the LSR is set to 512, and M M, L L, w for the MSR are set to 16 16, 4 4, 64, respectively. The MSR image provides good quality for the background area that is well-motion-compensated, but we can observe poor quality for the head area that can seldom be well-motion-compensated, as is shown in Fig. 7(a). On the contrary, the LSR image shows a relatively better quality for the head area than the MSR image, as is shown in Fig. 7(b). So, if we apply the MSR to the background and the LSR to the head area, then we can achieve the best visual quality. Fig. 9. (a) Mode decision flow. (b) Determination of a threshold T m. Thus, we propose to properly combine the MSR and the LSR as shown in Fig. 8. For each LR patch, the mode decision module chooses the best HR patch between two super-resolved patches synthesized by the LSR and the MSR. The details of the MSR and the LSR have already been described in previous sections. In this section, we depict the mode decision part and the post-processing part to reduce the blocking artifact at the patch boundary. A. Mode Decision First, we compute the MC error of each LR patch, which is defined as follows: min SAD(v) (6) v {v f,v b } where v f and v b denote the forward MV and backward MV, respectively. Then, the MC error is compared with an adaptive threshold T m as in Fig. 9 (a). So, if the MC error is larger than T m, then the LSR is applied. Otherwise, the MSR is applied. In general, the MC error tends to be proportional to variance of the LR block, that is, σ. So, T m is adaptively determined to be proportional to σ along with proper upper and lower bounds, i.e., T L and T U as in (7) [see Fig. 9 (b)] T m = min{α σ + T L,T U }. (7) T L and T U in (7) are empirically determined using dozens of training images having CIF, 720P, and 1080P formats which are different from test video sequences of Section V. In more detail, the common threshold range [T L, T U ] that the proposed algorithm provides the best PSNR for the training images is searched. From this experiment, T L and T U are set to 2 and 10, respectively.

SONG et al.: VIDEO SUPER-RESOLUTION ALGORITHM USING BI-DIRECTIONAL OVERLAPPED BLOCK MOTION COMPENSATION 279 Fig. 10. Histogram of patch variance σ for various video sequences. Fig. 11. Boundary (gray) pixels for post-processing. In this example, the left and right patches are synthesized by the LSR and the MSR, respectively. Also, we decide α from the histogram of σ for the training images such that the boundary σ between the lower 80% and upper 20%, i.e., σ U corresponds to T U (see Fig. 10). As in Fig. 10, σ U is around 16, hence α is empirically set to 0.5 for [T L, T U ] of [2] and [10]. Table I compares the adaptive thresholding based on (7) with constant thresholding in terms of PSNR. Here, the experimental condition is basically the same as that of Section V. From Table I, we can observe that the proposed adaptive thresholding follows well the best performance of the constant thresholding. B. Post-Processing At the boundaries between the temporally super-resolved patches and spatially super-resolved patches, we can observe blocking artifact because those patches are derived from different frames. Note that the temporally super-resolved patches are generated from HR KFs while spatially super-resolved patches are synthesized by weighted-averaging their target LR patches. In order to reduce the blocking artifact, we apply a simple smoothing filter to the boundary pixels. For example, grey boundary pixels in Fig. 11 are adjusted by averaging two super-resolved pixels by the LSR and the MSR at the same position, respectively. The remaining pixels except the boundary pixels keep unchanged. C. HR Picture Coding and Overhead Analysis Assume that the proposed algorithm is applied to digital image applications such as DSC and digital camcorders. The LR video sequence that the user is shooting is assumed to be compressed and recorded using the H.264 video coding standard [21], and the HR KFs is encoded by H.264 intra coding. The compressed HR bit-stream as well as LR bitstream should be practically de-compressed prior to the SR process, and an arbitrary LR frame selected from the decompressed LR video sequence may be super-resolved using the decoded HR KFs as shown in Fig. 8. Note that the LR KFs are included in the LR sequence. Since the HR KFs are only a sort of supplementary information, their compressed stream results in an overhead in terms of storage space. Note that normal interpolation algorithms do not require such overhead. So, we need to derive the coding conditions to obtain successfully interpolated frames while maintaining a reasonable HR overhead. If the LR-to- HR scaling ratio is 1:2 in both directions, and the compression ratios of LR sequence and HR KFs are 1/a and 1/b, respectively, then the HR overhead is defined as a/15b for N of 60 frames, e.g., 1 s or 2 s. In general, the compression ratio of H.264 inter-frame coding is over three times higher than that of H.264 intra-frame coding. In this case, the HR overhead can amount to 20% at minimum. Such an overhead can be burdensome in terms of memory cost for storage media. In order to further decrease the HR overhead, we utilize the high spatial correlation between the LR KF and its corresponding HR KF as in scalable video coding. That is, we employ the upscaled LR KF as the predictor of the HR KF as in inter-frame coding. Cubic convolution can be used for up-scaling the LR frames. Finally, the differential KF between the original HR KF and the predictor is encoded by H.264. We applied the differential HR encoding scheme to the first frames of several well-known 1280 720 video sequences; City, Mobcal, and Shields. The H.264 coding parameters of this experiment are described in Table II. Fig. 12 shows the comparison results with H.264 intra coding. We can reduce the HR overhead to below 10% while providing reasonable PSNR performance of about 35 db. Also, the differential HR coding provides about 2 db better PSNR than the H.264 intra coding on average. V. Experimental Results A. Simulation Environment For evaluating performance, we used popular MPEG video sequences: four CIF sequences (Containership, Mobile, News, and Hall Monitor), three 1280 720 sequences (City, Mobcal, and Shields), and one 1920 1080 sequence (Traffic). In addition, we used three 1920 1080 sequences that are acquired with their LR versions using the Canon 5D Mark II camera by the authors: Customer, Flower, and Resolution Chart, as is shown in Fig. 13. This paper sets the scaling ratio to 2.0. The original HR MPEG sequences were downsampled by a factor of 1/2 after anti-aliasing filtering having the following coefficients: [22 0 52 0 159 256 159 0 52 0 22]//512. We compared the performance of the proposed algorithm with those of six existing algorithms: bilinear interpolation (BLI), bi-cubic, NEDI [1], Farsiu s algorithm [6], Fan s algorithm [12], and Brandi s algorithm [14]. In implementing NEDI, the window size and the threshold to declare an edge pixel were both set to be 8. In Farsiu s algorithm that is a typical multiple-frame super-resolution, the number of reference frames was set to 30 frames, and the fast full search of RC-FSA was fairly used for motion estimation. Also, the

280 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 3, MARCH 2011 TABLE I PSNR Performance of the Hybrid SR According to Adaptive T m and Constant T m Adaptive Constant T m T m 1 2 3 4 5 6 7 8 9 10 11 12 Container 33.2 31.1 31.6 31.8 31.9 32.1 32.3 32.4 32.6 32.8 32.9 32.9 33 Hall Monitor 38 34.5 35.8 37.2 37.7 38 38 37.9 37.9 37.9 37.9 37.9 37.9 Mobile 25.5 24.3 24.3 24.5 24.6 24.8 25 25.2 25.4 25.5 25.6 25.6 25.6 News 36.1 35.1 35.5 35.9 36.1 36.1 36.2 36.1 36.1 36 35.9 35.8 35.6 City 32.9 33.1 33.3 33.4 33.4 33.3 33.2 33 32.9 32.9 32.8 32.6 32.3 Mobcal 31 28.8 29 29.3 29.7 30.2 30.5 30.8 30.9 31 31 31 31 Shields 32.7 32.3 32.5 32.7 32.8 32.8 32.8 32.7 32.6 32.5 32.5 32.4 32.4 Traffic 36.5 35.7 36.2 36.5 36.7 36.7 36.6 36.5 36.4 36.2 36.1 35.9 35.7 Fig. 13. Some of test video sequences. (a) Resolution chart. (b) Flower. (c) Customer. TABLE II Encoding Parameters Parameter Value Search range ±64 pels The number of reference frames 1 CABAC ON RD optimization OFF B-frame coding OFF Motion estimation mode Fast full search Rate control Enable Frame rate 30 Hz Fig. 12. Encoding performance according to HR overhead. (a) City. (b) Mobcal. (c) Shields. Values in bracket indicate coding bit-rate. block sizes of CIF and higher resolution sequences were set to 4 4 and 16 16, respectively. In Fan s algorithm, the number of nearest neighbors and the size of the LR and HR patches were set to 5 and 7 7, respectively. About 400 000 primitive examples were extracted from the same training images as [12]. For ME of Brandi s algorithm, a single matching block size of 16 16 that provides the best performance was evaluated, and the reference frames were interpolated using bi-cubic. We also applied the fast full-search part of RC-FSA to Brandi s algorithm in order to shorten the running time and to achieve a fair comparison with the proposed algorithm. The other conditions were the same as the proposed algorithm. Subjective quality as well as PSNR was compared. B. Performance Evaluation In order to evaluate the pure interpolation performance, we applied the various interpolation algorithms to the original LR and HR frames. In this experiment, the KF interval N was fixed to 30 frames, so the first 31 frames of each test sequence were extracted for this experiment. With the 0th and 30th frames as KFs, the 15th LR frame was interpolated as a target frame. The motion search range of RC-FSA for MSR was vertically and horizontally set to [ 64, +64] pixels, i.e., w of 64. L L and M M were set to 4 4 and 16 16, respectively. First of all, we examine the performance of the LSR according to the number of clusters K. From the simulation result of Table III, we can see that the LSR provides the

SONG et al.: VIDEO SUPER-RESOLUTION ALGORITHM USING BI-DIRECTIONAL OVERLAPPED BLOCK MOTION COMPENSATION 281 Fig. 14. Interpolation results for News. (a) Original. (b) BLI. (c) Bi-cubic. (d) NEDI. (e) Barsiu s. (f) Fan s. (g) Brandi s. (h) MSR. (i) Hybrid SR. best performance for K of 256 or 512. In the following experiments, K was set to 512. Table VI tabulates the PSNRs of several algorithms for the interpolated target LR frames. In order to clearly show the synergy obtained by combining the LSR and the MSR, we also listed the PSNR results of the MSR and the LSR in the table. We can see that the proposed hybrid SR can significantly accomplish better performance than the MSR and the LSR as well as the previous algorithms. This is because the MSR and the LSR raise a synergy effect in the proposed hybrid SR. For hall monitor sequence, PSNR of the hybrid SR has 9.8 db higher than the BLI, and 8.3 db higher than Fan s algorithm [12], which is one of the best learning-based SR methods. Note that proposed algorithm also outperforms the conventional multiple-frame SR algorithm [6] in terms of PSNR. Some target frames such as News and Traffic are rarely well-motion-compensated. For those sequences, the hybrid SR overcomes the drawback of the MSR, and it significantly improves PSNR performance up to about 4 db. Figs. 14 and 15 show that the proposed hybrid SR provides much better visual quality for News and Shields sequences than those provided by the other algorithms. Note that the upscaled image achieved by the proposed algorithm has more details and sharpness in comparison with those of the other TABLE III PSNR Performance of the LSR According to the Number of Clusters (K) The Number of Clusters 64 128 256 512 1024 2048 Container 28.7 29.7 30.6 30.6 28.9 27.3 Hall Monitor 30.8 32.2 33.5 34.1 30.7 28.6 Mobile 23.9 24.2 24.6 24.3 22.8 22.2 News 32.7 33.7 34.7 33.9 31.0 29.1 City 32.8 32.9 33.0 33.1 33.3 33.5 Mobcal 28.3 28.5 28.6 28.8 28.9 29.1 algorithms. In addition, we can observe that the proposed algorithm does not give rise to poor motion compensation [see Ballerina area in Fig. 14(g) (i) and Shields area in Fig. 15(g) (i)]. Moreover, Fig. 16 shows that the proposed algorithm significantly maintains better details than the other algorithms for real sequences such as Customer, which was shot by the authors. Thus, we can find that the proposed algorithm outperforms the other algorithms in terms of subjective quality as well as PSNR. On the other hand, Table V shows the PSNR performance of the MSR according to N with the target frame as the center when the target frame is the 30th frame of each video

282 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 3, MARCH 2011 Fig. 15. Interpolation results for a lower center region of Shields. (a) Original. (b) BLI. (c) Bi-cubic. (d) NEDI. (e) Barsiu s. (f) Fan s. (g) Brandi s. (h) MSR. (i) Hybrid SR. TABLE IV PSNR Comparison for Uncompressed Images [DB] Container Hall Monitor Mobile News City Mobcal Shields Traffic Customer Flower Resolution BLI 26.9 28.2 22.0 28.0 29.6 26.8 29.9 31.8 29.3 32.7 29.8 Bi-Cubic 27.9 29.1 22.9 29.4 30.7 27.7 31.1 33.3 29.6 33.0 30.6 [1] 26.7 28.2 21.9 28.2 28.4 26.7 29.7 31.7 29.4 32.7 29.9 [6] 29.3 30.8 24.2 31.5 31.3 29.3 31.3 31.5 30.1 33.6 30.9 [12] 27.8 29.7 22.6 30.4 29.6 26.9 30.0 31.4 29.9 33.1 30.3 [14] 28.8 30.4 23.4 30.4 31.6 27.8 30.9 33.1 31.0 34.5 31.4 MSR 31.9 37.4 24.5 31.9 31.9 30.9 31.4 32.3 35.6 38.8 38.5 LSR 30.6 34.1 24.3 33.9 33.1 28.8 32.3 35.3 32.9 35.3 32.4 Hybrid SR 33.2 38.0 25.5 36.1 32.9 31.0 32.7 36.5 36.4 39.6 40.5

SONG et al.: VIDEO SUPER-RESOLUTION ALGORITHM USING BI-DIRECTIONAL OVERLAPPED BLOCK MOTION COMPENSATION 283 Fig. 16. Interpolation results for a central region of Customer. (a) Original. (b) Bi-cubic. (c) NEDI. (d) Fan s. (e) MSR. (f) Hybrid SR. TABLE V PSNR Performance of the MSR According to N N =10 N =20 N =30 N =40 N =50 N =60 Container 35.1 32.1 31.3 31.8 31.2 31.5 Hall Monitor 33.0 32.6 32.2 32.0 31.2 31.4 Mobile 24.6 24.8 24.6 24.4 24.1 23.7 News 33.6 32.7 32.0 32.0 31.8 31.3 Shields 32.5 31.4 31.2 24.7 23.7 20.6 sequence. For Hall Monitor without global motion or Mobile with little global motion, KF interval N seldom affects the performance of the MSR. On the contrary, N dominates the interpolation performance for Shields having large global motion. So, we can conclude that the amount of global motion is a key factor to determine a parameter N. Based on this simulation result, we examined the dependence of the hybrid SR on N as in Fig. 17. The results coincided with our expectation that high dependence of the MSR on N may affect the performance of the hybrid SR. For example, the hybrid SR provides noticeably better performance as N decreases for Container and Shields sequences. However, for Mobile sequence, the proposed algorithm keeps almost consistent performance. In addition, Fig. 18 shows that the proposed algorithm provides relatively stable PSNR curves for several full video sequences. Note that the proposed hybrid SR algorithm maintains more stable visual quality than the MSR and Brandi s algorithm. TABLE VI Running Times for Various Algorithms [SEC] Bi-Cubic [6] [12] [14] Proposed RC-FSA EPZS 720p 3.6 84 724 27.9 33.5 12.7 1080p 9.5 156 998 56.3 65.2 23.2 C. Computational Complexity The above-mentioned experiments were executed on an Intel Core2 Duo CPU at 2.5 GHz with 3 GB RAM system. Table VI compares the running times of various algorithms to superresolve a single LR frame. The running time of the proposed algorithm with RC-FSA is similar to that of Brandi s algorithm because motion estimation occupies most of the complexity. Note that our algorithm achieves a significantly shorter running time than that of Fan s algorithm while still providing better visual quality. Also, the proposed hybrid SR algorithm is much faster than Farsiu s algorithm [6]. However, the proposed algorithm is still about seven times slower than the bi-cubic algorithm in case of 1080p sequences. In order to reduce such computational burden caused by motion estimation, we replaced the fast full search part of RC-FSA with a well-known fast ME algorithm, i.e., EPZS [22] which is informatively adopted in H.264 JM software. Therefore, we improved the processing speed of the proposed algorithm by about three times as shown in Table VI, with negligible PSNR degradation. The size of on-chip memory such as SRAM is a critical factor in hardware implementation. Here, we consider how

284 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 3, MARCH 2011 much the on-chip memory overhead caused by HR KFs in the proposed algorithm is. On-chip memory of the LSR is mostly required for a trained dictionary, and may amount to about 400 kb under the experimental condition of Section V-B. On the contrary, on-chip memory of the MSR is usually from the search area for motion estimation, e.g., 4.5 kb for w of 64. In more detail, the trained dictionary consists of the LR patches of 8 kb for indexing and the weight sets of 392 kb. Note that the only indexing part of the dictionary needs to be implemented as on-chip memory for real-time processing. In practice, we can store the remaining part of the dictionary, i.e., the weight sets into an external memory such as DDR SDRAM, and access the external memory to read the indexed weight sets only. As a result, the total size of on-chip memory for the MSR as well as the LSR can be reduced to about 13 kb which is a sufficiently acceptable memory size on chip. Fig. 17. Performance of the proposed algorithm according to N for several test images. (a) Container. (b) Mobile. (c) Shields. VI. Conclusion We proposed a video SR algorithm to interpolate an arbitrary frame in a LR video sequence from sparsely existing HR key frames. First, hierarchical motion estimation is performed between the input LR and the LR key frames on a patch basis. The LR patch is super-resolved using the bidirectional overlapped block motion compensation if it has a small motion-compensated error. Otherwise, the LR patch is spatially super-resolved using the dictionary that was learned from the LR and HR KFs on-the-fly. Finally, we applied a specific de-blocking process to reduce the block artifact that can occur at the boundary between the motion-compensated patch and the spatially super-resolved patch. The simulation results show that the proposed algorithm significantly provides better subjective visual quality as well as higher PSNR than those provided by previous interpolation algorithms. Acknowledgment This paper was supported by the DMC Research and Development Center, Samsung Electronics Company, Ltd., Suwon, Korea. Fig. 18. PSNR curves for full video sequences. (a) News. (b) Mobile. References [1] X. Li and M. T. Orchard, New edge-directed interpolation, IEEE Trans. Image Process., vol. 10, no. 10, pp. 1521 1527, Oct. 2001. [2] Q. Wang and R. K. Ward, A new orientation-adaptive interpolation method, IEEE Trans. Image Process., vol. 16, no. 4, pp. 889 900, Apr. 2007. [3] S. M. Kwak, J. H. Moon, and J. K. Han, Modified cubic convolution scaler for edge-directed nonuniform data, Opt. Eng., vol. 46, no. 10, p. 107001, 2007. [4] M. Zhao, M. Bosma, and G. de Haan, Making the best of legacy video on modern displays, J. Soc. Inform. Display, vol. 15, no. 1, pp. 49 60, 2007. [5] S. C. Park, M. K. Park, and M. G. Kang, Super-resolution image reconstruction: A technical overview, IEEE Signal Process. Mag., vol. 20, no. 3, pp. 21 36, May 2003. [6] S. Farsiu, D. Robinson, M. Elad, and P. Milanfar, Fast and robust multiframe super-resolution, IEEE Trans. Image Process., vol. 13, no. 10, pp. 1327 1344, Oct. 2004. [7] M. Protter, M. Elad, H. Takeda, and P. Milanfar, Generalizing the non-local-means to super-resolution reconstruction, IEEE Trans. Image Process., vol. 18, no. 1, pp. 36 51, Jan. 2009.

SONG et al.: VIDEO SUPER-RESOLUTION ALGORITHM USING BI-DIRECTIONAL OVERLAPPED BLOCK MOTION COMPENSATION 285 [8] M. V. Zibetti and J. Mayer, A robust and computationally efficient simultaneous super-resolution scheme for image sequences, IEEE Trans. Circuits Syst. Video Technol., vol. 17, no. 10, pp. 1288 1300, Oct. 2007. [9] W. Freeman, T. Jones, and E. Pasztor, Example-based super-resolution, IEEE Comput. Graph. Applicat., vol. 22, no. 2, pp. 55 65, Mar. Apr. 2002. [10] J. Sun, N. Zheng, H. Tao, and H. Shum, Image hallucination with primal sketch priors, in Proc. CVPR, vol. 2. 2003, pp. 729 736. [11] J. Sun, Z. Xu, and H. Shum, Image super-resolution using gradient profile prior, in Proc. CVPR, 2008, pp. 1 8. [12] W. Fan and D. Y. Yeung, Image hallucination using neighbor embedding over visual primitive manifolds, in Proc. CVPR, 2007, pp. 1 7. [13] X. Li, K. M. Lam, G. Qiu, L. Shen, and S. Wang, Example-based image super-resolution with class-specific predictors, J. Vis. Commun. Image Representation, vol. 20, no. 5, pp. 312 322, 2009. [14] F. Brandi, R. Queiroz, and D. Mukherjee, Super-resolution of video using key frames and motion estimation, in Proc. IEEE ICIP, Oct. 2008, pp. 321 324. [15] B. C. Song, K. W. Chun, and J. B. Ra, A rate-constrained fast fullsearch algorithm based on block sum pyramid, IEEE Trans. Image Process., vol. 14, no. 3, pp. 308 311, Mar. 2005. [16] K. Illgner and F. Muller, Multiresolution video compression: Motion estimation and vector field coding, in Proc. GLOBECOM, vol. 3. Nov. 1996, pp. 1478 1482. [17] M. T. Orchard and G. J. Sullivan, Overlapped block motion compensation: An estimation-theoretic approach, IEEE Trans. Image Process., vol. 3, no. 5, pp. 693 699, Sep. 1994. [18] T. Y. Kuo and C. C. J. Kuo, Fast overlapped block motion compensation with checkerboard block partitioning, IEEE Trans. Circuits Syst. Video Technol., vol. 8, no. 6, pp. 705 712, Oct. 1998. [19] Information Technology-Coding of Audio-Visual Objects: Visual, Committee Draft, ISO/IEC 144962-2, Oct. 1997. [20] S. Haykin, Adaptive Filter Theory, 3rd ed. Englewood Cliffs, NJ: Prentice-Hall, 1996, ch. 9. [21] Draft ITU-T Recommendation and Final Draft International Standard of Joint Video Specification, ITU-T Rec., H.264/ISO/IEC 14496-10 AVC, 2003. [22] A. M. Tourapis, Enhanced predictive zonal search for single and multiple frame motion estimation, in Proc. SPIE Visual Commun. Image Process., Jan. 2002, pp. 1069 1079. Byung Cheol Song (SM 08) received the B.S., M.S., and Ph.D. degrees in electrical engineering from the Korea Advanced Institute of Science and Technology, Daejeon, Korea, in 1994, 1996, and 2001, respectively. From 2001 to 2008, he was a Senior Engineer with the Digital Media Research and Development Center, Samsung Electronics Company, Ltd., Suwon, Korea. In March 2008, he joined the School of Electronic Engineering, Inha University, Incheon, Korea, and is currently an Assistant Professor. His current research interests include the general areas of video coding, video processing, super-resolution, stereo vision, multimedia system design, image coding, content-based multimedia retrieval, and data mining. Shin-Cheol Jeong received the B.S. degree in electronic engineering from Inha University, Incheon, Korea, in 2009. Currently, he is pursuing the M.S. degree in electronic engineering from the School of Electronic Engineering, Inha University. His current research interests include image processing, super-resolution, and video coding. Yanglim Choi received the B.S. degree in mathematics from the Korea Advanced Institute of Science and Technology, Daejeon, Korea, and the Ph.D. degree in mathematics from the California Institute of Technology, Pasadena, in 1991 and 1998, respectively. He joined the DMC Research and Development Center, Samsung Electronics, Company, Ltd., Suwon, Korea, in 1998, and is currently a Principal Engineer. His current research interests include the general areas of image/video processing, superresolution, stereo vision, content-based image search, and noise removal filtering.