arxiv: v1 [cs.cv] 29 Mar 2016

Size: px

Start display at page:

Download "arxiv: v1 [cs.cv] 29 Mar 2016"

Jemimah Booth
5 years ago
Views:

1 arxiv: v1 [cs.cv] 29 Mar 2016 FAST: Free Adaptive Super-Resolution via Transfer for Compressed Videos Zhengdong Zhang, Vivienne Sze Massachusetts Institute of Technology {zhangzd, Abstract. High resolution displays are increasingly popular, requiring most of the existing video content to be adapted to higher resolution. State-of-the-art super-resolution algorithms mainly address the visual quality of the output instead of real-time throughput. This paper introduces FAST, a framework to accelerate any image based super-resolution algorithm running on compressed videos. FAST leverages the similarity between adjacent frames in a video. Given the output of a superresolution algorithm on one frame, the technique adaptively transfers it to the adjacent frames and skips running the super-resolution algorithm. The transferring process has negligible computation cost because the required information, including motion vectors, block size, and prediction residual, are embedded in the compressed video for free. In this work, we show that FAST accelerates state-of-the-art super-resolution algorithms by up to an order of magnitude with acceptable quality loss of up to 0.2 db. Thus, we believe that the FAST framework is an important step towards enabling real-time super-resolution algorithms that upsample streamed videos for large displays 1. Keywords: Video coding, Super-resolution 1 Introduction Today screens with extremely high resolution are increasingly popular. There are televisions with 8K resolution ( ) and smart phone screens with 4K resolution ( ). Unfortunately, the lack of ultra high resolution content has limited the adoption of these large screens. Given the abundant amount of existing lower resolution videos, one solution would be a super-resolution (SR) algorithm to upsample these videos to higher resolution. Existing SR algorithms focus more on the visual quality of the upsampled video frames rather than the real-time processing capability. Multi-frame based algorithms typically run offline. Running single-image based SR algorithms independently on each frame is much faster. However, even the fastest algorithm, 1 20xx IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

Skip SR Transfer guided by (3) Run SR 2 Zhengdong Zhang, Vivienne Sze (4) High Resolution (x, y ) 01001101 10100110 10010110 11001110 Decode 11111100 10010111 00000011 (2) Low res video 01011101

2 Skip SR Transfer guided by (3) Run SR 2 Zhengdong Zhang, Vivienne Sze (4) High Resolution (x, y ) Decode (2) Low res video Block matching Block size w x h Target position (x, y) Source position (x, y ) (1) Stored video (3) Syntax elements h (5) High Resolution (a) Pipeline of FAST (x, y) w Ground-truth Bicubic SRCNN SRCNN with FAST (b) FAST result Fig. 1. (a) SR with FAST: From the compressed video (1), the video decoder first decodes the low-resolution video frames (2) and syntax elements (3). The SR algorithm is applied to the first frame to obtain a high resolution output (4). Using the syntax elements (3), FAST transfers the high resolution details from the first frame (4) to the second frame (5). There is no need to apply SR on the second frame. (b) FAST result: Running SRCNN with FAST preserves the rich high frequency details that SRCNN generates compared to the blurry output of bicubic interpolation. SRCNN[1] takes 0.4 seconds to run on an image as small as [1]. When scaled up to a video moderate size (e.g ), it takes seconds to process a single frame. Therefore, the computation cost becomes a bottleneck to run these algorithms in real time. This paper proposes a technique, called Free Adaptive Super-resolution via Transfer (FAST), to accelerate existing single-image based SR algorithms running frame-by-frame on videos. FAST leverages the inter-frame similarity between adjacent frames in a video sequence with a typical frame rate of 30 fps. Given the SR result of one frame, FAST transfers it to the others without running the SR algorithm on the later frame. The transfer process is significantly faster than SR, which amortizes away the running time of the SR algorithm across all frames. Thus, the more frames FAST transfers, the more reduction it gets on the running time. In addition, the information that enables FAST to transfer is already embedded in the compressed video. When compressing a target block of pixels in a frame, a modern video encoder (MPEG-2, H.264/AVC, H.265/HEVC) leverages the inter-frame similarity of a video and seeks a block in neighboring frames near the target block to predict it with small error. This technique is called motion compensation. The relative location of the block, called motion vectors, and the prediction error or residual, are part of the syntax elements embedded in the compressed videos. The encoder exploits the same temporal redundancy as FAST by using a common objective. Therefore, FAST can directly use the embedded syntax elements, which are freely available, to guide the transfer, skipping the block search. Fig. 1(a) illustrates the main steps of FAST. For further acceleration, FAST uses non-overlapping block division embedded in the compressed video; this is faster than overlapped blocks that are typically

3 FAST: Free Adaptive Super-Resolution via Transfer for Compressed Videos used in SR algorithms, since it avoids redundant computations. With a simple deblocking filter [2], FAST removes most of the artificial edges at the block boundary, while maintaining the quality of the result. FAST also adapts to the video content for improved quality and speed. Specifically, FAST operates on blocks of varying size that are adaptively selected by the video encoder based on the frame content. In addition, for each block, FAST adaptively enables and disables transfer depending on the prediction residual. In Sect. 7, we demonstrate how FAST accelerates state-of-the-art SR algorithms on the video sequences of the JCT-VC 2 common test conditions that were used in the development of the latest video coding standard, H.265/HEVC [3]. The test sequences include 20 diverse scenes of static, dynamic, natural and man-made video content. The acceleration can be up to 15 and the quality loss in is around 0.2 db, which proves the effectiveness of FAST. The main contribution of this paper is the technique, FAST, that amortizes away the running time of single image based SR methods. As far as we know, this paper is the first to show how to adaptively use the freely available information embedded in the compressed videos to do video super-resolution, reducing computations while maintaining the quality of the output. Moreover, FAST can be combined with any single image based SR algorithm. By accelerating SR algorithms by an order of magnitude, FAST can potentially lead to a system that can upsample streamed videos in real time, enabling better viewing experience on larger displays. Paper organization This paper discusses related works in Sect. 2. Then it reviews the motion compensation in video coding and formulates FAST in Sect. 3. In Sect. 4, the paper shows how to adaptively perform the transfer depending on the prediction residual to suppress artifacts. In Sect. 5, it shows how FAST uses the non-overlapping blocks and deblocking filter to reduce computations without loss of quality. Sect. 6 discusses how the syntax elements of real encoded videos allows FAST to further skip more computations, and how varying block size further increases the savings. Sect. 7 presents the evaluation results, and the paper concludes with discussion and summary of contributions in Sect Previous Work Single-frame based super-resolution In the recent decades, super-resolution on a single image is typically formulated as a learning problem [4,5]. Many pairs of low-resolution and high-resolution image patches are collected from training images, forming a training dictionary. Various learning techniques exploit it to hallucinate the lost high-frequency information. Among them are sparserepresentation [6,7], kernel ridge regression (KRR) [8], anchored neighbor regression (ANR) [9], and in-place example regression [10]. Note that FAST is similar to in-place example regression[10] which performs local block prediction inside the same frame, while FAST uses predictions across frames. Recently, there are methods like SRCNN[1] and [11] that apply deep neural networks to do SR. They achieve state-of-the-art results with the fastest speed, albeit still not fast enough 2 The standard body that developed H.265/HEVC

Group-of-pictures (GOP) 4 Zhengdong Zhang, Vivienne Sze Motion compensation (x, y ) (x, y) (a) Frames in a GOP structure are similar Source patch Frame 1 Frame 2 Residual Target patch Motion Vector

compressed video Block info Motion vector Residual Fig. 2. Reviewing the basics of video coding. to run in real time on videos.

4 Group-of-pictures (GOP) 4 Zhengdong Zhang, Vivienne Sze Motion compensation (x, y ) (x, y) (a) Frames in a GOP structure are similar Source patch Frame 1 Frame 2 Residual Target patch Motion Vector (MV) (x, y) (x, y ) Encode (b) Predict a block with motion compensation, and embed the residual and motion vector in the compressed video Block info Motion vector Residual Fig. 2. Reviewing the basics of video coding. to run in real time on videos. Complementary to learning approaches, there are algorithms that exploits the self-similarities of patches within each image[12,13]. However, they are much slower than SRCNN. Multiple-frame based super-resolution Multiple-frame based super-resolution algorithms are largely based on the registration of neighboring frames [14]. Many of these algorithms are iterative, including the Bayesian based approach[15] and the l 1 -regularized total variation based approach[16]. At the same time, there are non-iterative methods that avoids registration with non-local mean[17] and 3D steer kernel regression [18]. Deep learning is also explored, with [19] applying bidirectional recurrent convolutional networks, and [20] using deep draftensemble learning. These algorithms are slower than the single image superresolution algorithm and generally run offline. Video coding It is a common technique to predict a block in the current frame by a motion compensated block from adjacent frames in video coding, dating back to MPEG-2 developed in the 1990s[21]. Later, H.264/AVC[22] was developed to handle high-resolution videos through a more flexible encoding scheme that allowed more prediction and varying block sizes. Bundled with more features, the latest standard H.265/HEVC[23], achieves the same visual quality as H.264/AVC at half the bitrate [24]. Despite decades of progress, motion compensation between neighboring frames remains the core component of the modern video codecs. 3 Transferring with Motion Vectors and Residuals This section first reviews the basics of video coding techniques to help understand how FAST transfers SR results. It then describes how FAST works, and emphasizes the importance of the quarter-pixel accurate motion vector from the compressed video for FAST to obtain high SR quality.

5 FAST: Free Adaptive Super-Resolution via Transfer for Compressed Videos 3.1 Reviewing the basics of video coding Group-of-pictures (GOP) structure A modern video encoder typically divides a video sequence into many sections, called a group-of-pictures (GOP) structure. Each GOP structure contains between 6 to 40 frames. In most cases, frames in the same GOP structure are visually similar with no scene transitions. See Fig. 2(a) for an illustration. Motion compensation To encode a block in a frame, a modern video encoder searches for similar blocks from previously encoded frames inside the same GOP structure to predict it with minimal error. Thus, the encoder signals the small difference between the two blocks, which is easier to compress than the original block which may contain complex content. The offset between the location of the predicting block and the location of the target block is called a motion vector, and it can be fractional. The difference between the two blocks is called a residual. Both the motion vector and the residual are part of the syntax elements that are embedded in the compressed video. Fig. 2(b) visualizes this process. At the decoding stage, the decoder retrieves these syntax elements, performs motion compensation to generate the predicting block, and combines it with the residual to reconstruct the target block. Embedded syntax elements The compressed video contains the syntax elements for decoders to use. This includes GOP structure, the block division in each frame (discussed later in Sect. 5), the motion vector for each block, and the prediction residuals. Those syntax elements are available for free when the video is decoded. 3.2 Formulating super-resolution transferring Here we describe how FAST upsamples a compressed video to α Z + times larger. For simplicity, we only consider two adjacent frames, denoted by I1 l and I2, l where blocks in I2 l are all predicted by motion compensated blocks in I1. l Here, the subscript indicates the frame index, and the superscript l stands for lower resolution. The goal is to compute the higher resolution images I1 h and I2 h, where the superscript h stands for higher resolution. We apply a SR algorithm f sr ( ) on Il l to get ( ) Ih 1 = f sr I l l. Instead of applying fsr ( ) on the second frame I2 l to get I2 h, FAST transfers I1 h to get I2 h using the syntax elements, as is shown in Fig. 3. FAST repeats the transfer for I2 h block by block. We now us examine how FAST upsamples a low-resolution block P2(x) l R H W on I2 l to higher resolution block P2 h (αx) R αh αw, which is α times larger. Here x R 2 and αx R 2 denote the location of these blocks on the corresponding images. The decoder tells FAST that P2(x) l is predicted by a block from I1 l with motion vector dx 1,2 and prediction residual R1,2(x) l R H W. From the definition of motion compensation, we have P l 2(x) = P l 1(x + dx 1,2) + R l 1,2(x) (1)

6 Zhengdong Zhang, Vivienne Sze P h l 1 α x + dx 1,2 b R P h 1,2 x 2 (αx) Paste I 1 h I 2 h Bicubic P l P l 1 (x + dx 1,2 ) l R 1,2 (x) 2 (x) High Res Frame 1 SR Algorithm f sr I 1 l I 2 l High Res

6 6 Zhengdong Zhang, Vivienne Sze P h l 1 α x + dx 1,2 b R P h 1,2 x 2 (αx) Paste I 1 h I 2 h Bicubic P l P l 1 (x + dx 1,2 ) l R 1,2 (x) 2 (x) High Res Frame 1 SR Algorithm f sr I 1 l I 2 l High Res Frame 2 Low Res Frame 1 Low Res Frame 2 motion vector dx 1,2 Fig. 3. Pipeline of the FAST algorithm Here the subscripts of dx l 1,2 and R l 1,2 highlight the prediction direction from frame 1 to frame 2. Furthermore, let P h 1 (αx) = f sr ( P l 1 (x) ). Using Taylor expansion, FAST approximates P h 2 (αx) by ) ( ) P2 h (αx) = f sr (P1(x l + dx 1,2) + R1,2(x) l f sr P1(x l + dx 1,2) + f sr, R1,2(x) l (2) In [10], it is claimed that f, R1,2(x) l is close to the upsampling and sharpening operator. In practice, we observe that bicubic interpolation on R1,2(x) l approximates f sr, R1,2(x) l well enough, and saves computations. Therefore, denoting b( ) as the α-times bicubic upsampler, FAST finally outputs ( ) P2 h (αx) P1 h (α(x + dx 1,2)) + b R1,2(x) l (3) In summary, FAST transfers the high-resolution output of f sr on frame 1 to frame 2. FAST requires: (1) P h 1 (α(x + dx 1,2 )) obtained via motion compensation on I h 1 with motion vector αdx 1,2 at αx, and (2) the bicubicly upsampled residual to obtain each block in I h 2. Since dx 1,2 may be fractional, P h 1 (α(x + dx 1,2 )) may require interpolation. Observe that FAST skips applying f sr to the second frame entirely, and all operations are similar in complexity to bicubic upsampling. This gives significant savings in computation compared to that of modern SR algorithms such as SRCNN. Sect. 7 shows how effective this approximation is in maintaining the quality of the SR result. 3.3 Importance of quarter pixel accurate motion vector The motion vector dx 1,2 in Eq. (1) can be pixel accurate or sub-pixel accurate, which has a large impact on the quality of the FAST output. To demonstrate this, we build a simplified video compression algorithm to provide FAST with motion vector of controlled accuracy on two consecutive low-resolution frames which are synthetically downsampled from the Middlebury dataset[25]. We consider integer pixel, half-pixel and quarter pixel accuracy. For each setting, we run SRCNN to

FAST: Free Adaptive Super-Resolution via Transfer for Compressed Videos 38 37 36 36.12 37.27 37.31 37.

7 FAST: Free Adaptive Super-Resolution via Transfer for Compressed Videos (a) Bicubic (b) FAST with pixel accurate motion vector (c) FAST with quarter-pixel accurate motion vector 35 bicubic pixel half quarter pixel pixel (d) vs motion vector accuracy Fig. 4. (a)(b)(c) FAST with quarter-pixel accurate motion vector gives sharper SR results compared with bicubic interpolation, and avoids the artifacts near the edges in the output of FAST with pixel accurate motion vector. (d) of FAST output increases as the motion vector becomes more accurate. upsample the first frame and run FAST for the second frame. Then we compute the between the ground-truth high resolution second frame with the FAST output. Fig. 4 presents the results. We can see that increases with the higher of motion vector accuracy, validating the need for quarter-pixel motion vector. Fortunately, motion vectors with quarter pixel accuracy are available in modern video codecs, which is sufficient to enable high quality transfer. Again, this information is provided for free. 4 Adaptive Transfer Based on Residual Magnitude In practice, motion prediction may not be perfect. For some blocks, the prediction residual can be high in energy and contain sharp edges, which causes ringing artifacts on the transferred output. This section discusses how FAST avoids these artifacts by adaptively applying the transfer based on whether the residual magnitude exceeds a given threshold. 4.1 Ringing artifacts due to large prediction residual Fig. 5(a) highlights the ringing artifacts that we observe in the output if we use FAST on all the blocks of a frame. Note that the shown artifact occurs on the smooth regions, rather than at sharp edges. Intuitively, this occurs when the encoder predicts a smooth region with a source patch that contains a sharp edge. Consequently, the prediction residual will also have a sharp edge at the same position. By running SR on the source patch, the sharp edge gets preserved, but the edge on the residual is blurred by the bicubic interpolation. The combination of a sharp edge with a blurry edge unsurprisingly creates ringing artifacts. Fig. 5(b) illustrates this phenomenon. 4.2 Thresholding the residual to avoid ringing artifacts To avoid the ringing artifacts, FAST computes the mean absolute magnitude of the residual block, and thresholds it to decide whether to transfer the block or not. If the magnitude of the residual is small, transfer is performed. Otherwise, the low-resolution patch is directly upsampled using bicubic interpolation. To properly learn the threshold η, we collect many pairs of source and target images, and synthetically downsample them to lower resolution. Then we divide the low-res target image into blocks. For each block Pi l, we search for the best

Ground truth Adaptive Non adaptive 8 Zhengdong Zhang, Vivienne Sze I(x) I(x) I(x) (a) Adaptive transfer to avoid artifacts Low res input x x High res output encode x Low res prediction SR I(x) I(x)

8 Ground truth Adaptive Non adaptive 8 Zhengdong Zhang, Vivienne Sze I(x) I(x) I(x) (a) Adaptive transfer to avoid artifacts Low res input x x High res output encode x Low res prediction SR I(x) I(x) algorithm ringing artifact reconstruct sharp x High res prediction (b) Cause of the artifact Low res residual I(x) High res residual x Bicubic interpolation blur x Fig. 5. (a) Adaptive transfer avoids the ringing artifact in flat areas of the output of non-adaptive transfer. (b) Sharp edges in the residual causes ringing artifacts. prediction patch on the source image for motion compensation from which the residual with mean absolute magnitude e i is obtained. Given P h,t i, the transferred outputs and P h,b i, the upsampled residual with bicubic interpolation, we compute the, yi t and yb i, using the high resolution ground-truth. Here the superscript t stands for transfer, b stands for bicubic and h stands for high resolution. Suppose FAST only transfers when e i < η. Then η is chosen to maximize the across all blocks with the following optimization: max yi t + yi b (4) η i,e i η i,e i >η In all the experiments presented in the paper, we set η = 10. Fig. 5(a) shows how this simple threshold avoids the ringing artifact in FAST result. 4.3 Thresholding is more than an acceleration tool It is a common technique to threshold the image content to adaptively apply either bicubic interpolation or sophisticated SR methods. Typically, this technique is considered as an acceleration tool. However, we have shown that in the case of FAST, it is also a method to get better performance. This is not an isolated case. We observe the same phenomenon when running the in-place example regression SR[10]. Similar with FAST, it decomposes a patch to an in-place prediction and its residual. In our experiment on the sequence RubberWhale in Middlebury dataset[25], in-place example regression SR actually performs worse without thresholding resulting in a lower (29.56 db) than bicubic interpolation (31.52 db). However, with proper thresholding, the method actually gives significantly better (32.68 db). 5 Non-overlapping Blocks with Deblocking Filters Most of the SR algorithms divide an image into densely overlapped blocks, and average the output on these overlapped blocks to avoid discontinuities on the block boundaries. This is very expensive since one pixel in a frame is processed

FAST: Free Adaptive Super-Resolution via Transfer for Compressed Videos No With deblocking deblocking SRCNN 32.9 32.8 32.7 32.6 32.5 32.4 32.3 32.55 32.87 32.53 32.

(a) An image is adaptively divided into non-overlapping blocks, with larger blocks corresponding to simple and well-predicted content.

9 FAST: Free Adaptive Super-Resolution via Transfer for Compressed Videos No With deblocking deblocking SRCNN bicubic SRCNN no deblocking deblocking (a) Block structure (b) Examples of deblocking (c) on Cactus Fig. 6. (a) An image is adaptively divided into non-overlapping blocks, with larger blocks corresponding to simple and well-predicted content. (b) Examples of running SRCNN with FAST before and after deblocking, compared to SRCNN results on 2nd frame. Best viewed in color. (c) improvement via deblocking on a test sequence Cactus. multiple times in different blocks that cover it. In contrast, FAST uses nonoverlapping block division so that each pixel is covered by exactly one block. Hence each pixel is only processed once, significantly reducing the computations. Fig. 6(a) shows examples of non-overlapping block divisions from videos compressed by the latest standard H.265/HEVC. However, such non-overlapping block division introduces artificial edges on the block boundary. FAST addresses this by applying an adaptive deblocking filter similar to the ones used in video codecs on the block boundaries[2]. Fig. 6(b)(c) shows how such filters remove the artificial edges and improves the quality of FAST output in terms of. The main objective of the deblocking algorithm is to remove the artificial edges on the block boundary due to non-overlapping block based coding while keeping the true edges. An important heuristic is that an edge tends to be artificial if there is little variations on both sides of the block boundary. Additional information like the difference of motion vectors can also help to decide whether to deblock or not. The smoothing strength of the deblocking filter is determined based on the statistics of the pixels near the block boundary. The deblocking filter in H.265/HEVC requires fewer operations than bicubic interpolation on a block, so it is negligible in computation cost. The capability of non-overlapping blocks with deblocking filter is also recognized in video coding where such a technique achieves comparable visual quality[26] with techniques using overlapped blocks[27]. For more technical details, please refer to [2].

10 Zhengdong Zhang, Vivienne Sze Fraction of pixels 16x16 25% 32x32 29% 64x64 30% 8x8 16% 8x8 16x16 32x32 64x64 (a) Pixel distribution in blocks with different size Fraction of pixels Motion interp

Bicubic 18% Skip Skip 82% 32x32 29% 16x16 23% Bicubic 8x8 14% 64x64 34% (c) Fraction of pixels belonging to blocks with zero residual, with bicubic interpolation of the residual skipped Fig. 7.

(3), the output of FAST on a pixel in the high resolution frame is the sum of two components: the transferred pixel from P h 1 (α(x + dx 1,2 )) and the bicubicly upsampled residual, b ( R l 1,2(x) ).

1 Blocks with zero motion vector or zero residual To further reduce computations, FAST leverages the following two conditions to avoid unnecessary interpolations for certain blocks: 1.

Blocks with zero residual For pixels in such blocks, the bicubic interpolation on the residual can be skipped. Note that such conditions can be checked at the block level.

Once a block satisfies either of the conditions, FAST applies the corresponding short-cut to all the pixels in the block.

$7(b) shows that 43% of all the pixels belong to blocks with zero motion vector so that they can be copied without fractional interpolation.$

10 10 Zhengdong Zhang, Vivienne Sze Fraction of pixels 16x16 25% 32x32 29% 64x64 30% 8x8 16% 8x8 16x16 32x32 64x64 (a) Pixel distribution in blocks with different size Fraction of pixels Motion interp 57% Motion interp Direct copy 43% 32x32 28% Direct copy 16x16 16% 8x8 9% (b) Fraction of pixels belonging to blocks with zero motion vector that get directly copied 64x64 47% Fraction of pixels Bicubic 18% Skip Skip 82% 32x32 29% 16x16 23% Bicubic 8x8 14% 64x64 34% (c) Fraction of pixels belonging to blocks with zero residual, with bicubic interpolation of the residual skipped Fig. 7. Statistics of block structure for all encoded video sequences 6 Further Reduction of Computation Cost in FAST According to Eq. (3), the output of FAST on a pixel in the high resolution frame is the sum of two components: the transferred pixel from P h 1 (α(x + dx 1,2 )) and the bicubicly upsampled residual, b ( R l 1,2(x) ). Both require interpolation. 6.1 Blocks with zero motion vector or zero residual To further reduce computations, FAST leverages the following two conditions to avoid unnecessary interpolations for certain blocks: 1. Blocks with zero motion vector For pixels in such blocks, FAST can directly copy the same pixels from the previous frame without interpolation to get P h 1 (α(x + dx 1,2 )) 2. Blocks with zero residual For pixels in such blocks, the bicubic interpolation on the residual can be skipped. Note that such conditions can be checked at the block level. In fact, the compressed video contains a bit for each block called the skip flag. It indicates whether the residual is all zero or not, and is freely available to FAST. Once a block satisfies either of the conditions, FAST applies the corresponding short-cut to all the pixels in the block. In practice, we observe these two conditions are met by many blocks from the compressed test videos in Sect. 7. Fig. 7(b) shows that 43% of all the pixels belong to blocks with zero motion vector so that they can be copied without fractional interpolation. In addition, 82% of all the pixels belong to blocks with zero residual, with bicubic interpolation skipped. Overall, these two conditions further reduce the computation cost of FAST by more than half. 6.2 Blocks with varying size speed up FAST In H.264/AVC and H.265/HEVC, frames are actually divided into blocks of varying size, with larger blocks assigned to flat or well-predictable regions and smaller blocks assigned to highly textured areas, as shown in Fig. 6. FAST gains additional speed by copying and skipping the pixels that belong to large blocks with a single check on residual skip-flag and zero motion vector. As is shown in Fig. 7(a),larger blocks (32 32 and 64 64) account for more pixels than smaller blocks (8 8 and 16 16). This is further supported by Fig. 7(b)(c) which demonstrate that most of the pixels that get copied or skipped belong to largest blocks.

11 Loss FAST: Free Adaptive Super-Resolution via Transfer for Compressed Videos Frame ID Mean 4 Mean 16 SRCNN FAST Bicubic (a) of algorithms on the test sequence FourPeople Frame ID Mean 4 Mean 16 SRCNN FAST Speed up 1.0x 775.5x 556.0x 660.2x 267.7x 818.5x 663.1x 691.4x 258.8x 754.2x 594.5x 653.1x 265.6x 518.0x 398.9x 731.0x 3.9x 15.1x (b) Running time of algorithms on the test sequence FourPeople Speed up (c) Trade-off between speedup and quality loss Fig. 8. By transferring more frames in a chained manner, FAST achieves more speedup at the cost of larger loss of quality. Nevertheless, even with more than 10x speed-up, the loss of is only 0.2 db. 7 Experimental Results 7.1 Evaluation dataset and setup We evaluate FAST on the video sequences of the JCT-VC common test conditions[3]. As these test videos were used in the development of latest video coding standard H.265/HEVC, they cover a wide variety of video content. We use the original uncompressed videos as the high-resolution ground-truth. Then we synthetically downsample them to lower-resolution, and encode them with the H.265/HEVC encoder[28]. The SR algorithms take the decompressed frames as well as the syntax elements as input. We use KRR[8], ANR[9] and SRCNN[1] 3 as the SR algorithms in the experiment. For each algorithm, we conduct two experiments. In the first experiment, we run the SR algorithm directly on all of the low-resolution frames. In the second experiment, we use the SR algorithm to upsample the first frame, then use FAST to transfer the result to all the rest of the frames. We call this running SR with FAST. For quantitative evaluation, we compute the between each output frame and the ground-truth high-resolution frame. Because the groundtruth frames are not compressed, there is an initial degradation of due to the quantization in the lossy compressed low-resolution videos. This explains why the reported gain of SR algorithms against bicubic interpolation in our experiments is around 1 db lower than the gains reported in original papers. We also measure the running time of all the algorithms to show how FAST accelerates SR algorithms with a non-optimized MATLAB implementation. All the experiments are conducted on a 3.3GHZ Xeon CPU. Note that we choose to encode a video by a chained GOP structure, i.e., a frame is always predicted from the previous frame. This is actually the most challenging case for FAST algorithm, since the difference between FAST output and SR output accumulates when more transfer is performed. 3 We contacted the author for the C/C++ implementation, but we could only get the MATLAB code. So the reported running time of SRCNN is much slower than the numbers in [1]

Frame 16 Frame 2 Frame 16 Frame 2 Frame 16 Frame 2 12 Zhengdong Zhang, Vivienne Sze Ground-truth SRCNN SRCNN with

Running different SR algorithms with FAST on different frames of different sequences.

8(a) tabulates the for each individual frame as well as the average of the first 4 and 16 frames when running FAST

At the 16 th frame, FAST gives slightly worse (< 0.

Again, this is the most challenging test case, and such accumulated degradation in the later frames will not happen

12 Frame 16 Frame 2 Frame 16 Frame 2 Frame 16 Frame 2 12 Zhengdong Zhang, Vivienne Sze Ground-truth SRCNN SRCNN with FAST Bicubic Ground-truth ANR ANR with FAST Bicubic Ground-truth KRR KRR with FAST Bicubic Fig. 9. Running different SR algorithms with FAST on different frames of different sequences. Note how FAST maintains the appearance of SR output.best viewed in color. 7.2 Evaluation results SR quality Fig. 8(a) tabulates the for each individual frame as well as the average of the first 4 and 16 frames when running FAST with SRCNN on the sequence FourPeople. As we see, up to the first 4 frames FAST gives comparable as SRCNN. At the 16 th frame, FAST gives slightly worse (< 0.2 db) than SRCNN, which is acceptable and still significantly better than bicubic. Again, this is the most challenging test case, and such accumulated degradation in the later frames will not happen in other setups where all intermediate frames are predicted from the two fixed key frames at both ends of the GOP structure. Visual illustration Fig. 9 shows how FAST maintains the quality of the output of different SR algorithms, with better performance than bicubic interpolation.

13 FAST: Free Adaptive Super-Resolution via Transfer for Compressed Videos Sequence name size SR KRR SRCNN ANR FAST Speed up SR FAST Speed up SR FAST Speed up Bicubic BQMall 416x x x x BQSquare 208x x x x BQTerrace 960x x x x BasketballDrill 416x x x x BasketballDrillText 416x x x x BasketballDrive 960x x x x BasketballPass 208x x x x BlowingBubbles 208x x x x Cactus 208x x x x ChinaSpeed 960x x x x FourPeople 512x x x x Johnny 640x x x x Kimono 960x x x x KristenAndSara 640x x x x ParkScene 960x x x x PartyScene 416x x x x PeopleOnStreet 1280x x x x RaceHorses 416x x x x SlideEditing 640x x x x Traffic 1280x x x x Average x x x Table 1. With 3 trasnfers from the 1 st frame to the 4 th frame, FAST gets around 4 speed up uniformly across all sequences for all SR algorithms, with no quality loss. It presents FAST output for both the 2 nd and the 16 th frame. Note how similar the 2 nd frame is to the 16 th frame in these sequences, which intuitively explains the effectiveness of FAST. Acceleration Fig. 8(b) tabulates the running time of SRCNN, and SRCNN with FAST on each frame of the sequence FourPeople. The average running time on the first 4 and 16 frames is also included. We can see that the cost of the transfer is negligible compared with SRCNN. Therefore, the average running time per frame is approximately the running time of applying SR to the first frame divided by the number of processed frames. Trade-off between and acceleration ratio With the chained GOP structure, there is a trade-off between acceleration ratio and average. Fig. 8(c) shows such trade-off for running SRCNN with FAST on the sequence BlowingBubbles. Again, this trade-off exists only in chained GOP structures. Additional results Table 1 shows the as well as acceleration ratio of running all the SR algorithms with FAST across all sequences when FAST only transfers to the 4 th frame. Table 2 shows the same metrics if FAST transfers to the 16 th frame. It is clear from the tables that FAST maintains the and

14 14 Zhengdong Zhang, Vivienne Sze Sequence name size SR KRR SRCNN ANR FAST Speed up SR FAST Speed up SR FAST Speed up Bicubic BQMall 416x BQSquare 208x BQTerrace 960x BasketballDrill 416x BasketballDrillText 416x BasketballDrive 960x BasketballPass 208x BlowingBubbles 208x Cactus 208x ChinaSpeed 960x FourPeople 512x Johnny 640x Kimono 960x KristenAndSara 640x ParkScene 960x PartyScene 416x PeopleOnStreet 1280x RaceHorses 416x SlideEditing 640x Traffic 1280x Average Table 2. With 15 transfers from the 1 st frame to the 16 th frame, FAST gets more than 10 speed up on average over all sequences for all SR algorithms, with around 0.2dB loss. Nevertheless, the of FAST output is still significantly higher than the bicubic output. significantly accelerates every SR algorithm on every sequence, proving that FAST can be combined with any SR algorithm. Note FAST is currently implemented in MATLAB, and Sect. 5 shows that in theory the computation cost of the algorithm should be as low as bicubic interpolation. Therefore, we believe that FAST has the potential of enabling SR algorithms to upsample videos for large screens in real time. 8 Discussion and Conclusions We have shown how FAST helps accelerate various SR algorithms with acceptable quality loss. The key idea behind FAST is to exploit the temporal redundancy of a video, which is embedded as syntax elements in the compressed videos by modern video codecs. As far as we know, FAST is the first technique to adaptively use this free information for fast video super-resolution. FAST also demonstrates how non-overlapping block division with deblocking filter saves computations and avoids artifacts near the block boundary. The large number of blocks with either zero-motion vector or zero-residuals enable FAST to further

15 FAST: Free Adaptive Super-Resolution via Transfer for Compressed Videos skip more computations, and blocks with varying size enables FAST to exploit this more efficiently. Given its effectiveness and low complexity, FAST can be combined with any SR algorithm. Therefore, we believe that the FAST framework is an important step towards running high quality SR algorithm in real time at 30 fps on lower resolution content for ultra high resolution displays.

16 16 Zhengdong Zhang, Vivienne Sze References 1. Dong, C., Loy, C.C., He, K., Tang, X.: Learning a deep convolutional network for image super-resolution. In: Computer Vision ECCV Springer (2014) Norkin, A., Bjontegaard, G., Fuldseth, A., Narroschke, M., Ikeda, M., Andersson, K., Zhou, M., Van der Auwera, G.: Hevc deblocking filter. Circuits and Systems for Video Technology, IEEE Transactions on 22(12) (2012) Bossen, F.: Common test conditions and software reference configurations. document JCTVC-H1100, JCT-VC, CA, Feb Freeman, W.T., Pasztor, E.C., Carmichael, O.T.: Learning low-level vision. International journal of computer vision 40(1) (2000) Freeman, W.T., Jones, T.R., Pasztor, E.C.: Example-based super-resolution. Computer Graphics and Applications, IEEE 22(2) (2002) Yang, J., Wright, J., Huang, T.S., Ma, Y.: Image super-resolution via sparse representation. Image Processing, IEEE Transactions on 19(11) (2010) Zeyde, R., Elad, M., Protter, M.: On single image scale-up using sparserepresentations. In: Curves and Surfaces. Springer (2010) Kim, K.I., Kwon, Y.: Single-image super-resolution using sparse regression and natural image prior. Pattern Analysis and Machine Intelligence, IEEE Transactions on 32(6) (2010) Timofte, R., Smet, V., Gool, L.: Anchored neighborhood regression for fast example-based super-resolution. In: Proceedings of the IEEE International Conference on Computer Vision. (2013) Yang, J., Lin, Z., Cohen, S.: Fast image super-resolution based on in-place example regression. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2013) Bruna, J., Sprechmann, P., LeCun, Y.: Super-resolution with deep convolutional sufficient statistics. arxiv preprint arxiv: (2015) 12. Glasner, D., Bagon, S., Irani, M.: Super-resolution from a single image. In: Computer Vision, 2009 IEEE 12th International Conference on, IEEE (2009) Huang, J.B., Singh, A., Ahuja, N.: Single image super-resolution from transformed self-exemplars. In: Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, IEEE (2015) Baker, S., Kanade, T.: Super-resolution optical flow. Carnegie Mellon University, The Robotics Institute (1999) 15. Liu, C., Sun, D.: A bayesian approach to adaptive video super resolution. In: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, IEEE (2011) Mitzel, D., Pock, T., Schoenemann, T., Cremers, D.: Video super resolution using duality based tv-l 1 optical flow. In: Pattern Recognition. Springer (2009) Protter, M., Elad, M., Takeda, H., Milanfar, P.: Generalizing the nonlocal-means to super-resolution reconstruction. Image Processing, IEEE Transactions on 18(1) (2009) Takeda, H., Milanfar, P., Protter, M., Elad, M.: Super-resolution without explicit subpixel motion estimation. Image Processing, IEEE Transactions on 18(9) (2009) Huang, Y., Wang, W., Wang, L.: Bidirectional recurrent convolutional networks for multi-frame super-resolution. In: Advances in Neural Information Processing Systems. (2015)

17 FAST: Free Adaptive Super-Resolution via Transfer for Compressed Videos 20. Liao, R., Tao, X., Li, R., Ma, Z., Jia, J.: Video super-resolution via deep draftensemble learning. In: Proceedings of the IEEE International Conference on Computer Vision. (2015) Le Gall, D.: Mpeg: A video compression standard for multimedia applications. Communications of the ACM 34(4) (1991) : Recommendation ITU-T H.264: Advanced Video Coding for Generic Audiovisual Services. Technical report, ITU-T (2003) 23. : High efficiency video coding. ITU-T Recommendation H.265 and ISO/IEC (April 2013) 24. Ohm, J.R., Sullivan, G.J., Schwarz, H., Tan, T.K., Wiegand, T.: Comparison of the coding efficiency of video coding standardsincluding high efficiency video coding (hevc). Circuits and Systems for Video Technology, IEEE Transactions on 22(12) (2012) Baker, S., Scharstein, D., Lewis, J., Roth, S., Black, M.J., Szeliski, R.: A database and evaluation methodology for optical flow. International Journal of Computer Vision 92(1) (2011) Budagavi, M., Fuldseth, A., Bjntegaard, G.: HEVC Transform and Quantization. In Sze, V., Budagavi, M., Sullivan, G.J., eds.: High Efficiency Video Coding (HEVC): Algorithms and Architectures. Springer (2014) 27. Malvar, H.S.: Lapped transforms for efficient transform/subband coding. Acoustics, Speech and Signal Processing, IEEE Transactions on 38(6) (1990) Sullivan, G.J., Ohm, J., Tan, T.K., Wiegand, T.: Overview of the High Efciency Video Coding (HEVC) Standard. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) 22(12) (Dec 2012)

FAST: A Framework to Accelerate Super-Resolution Processing on Compressed Videos

FAST: A Framework to Accelerate Super-Resolution Processing on Compressed Videos Zhengdong Zhang, Vivienne Sze Massachusetts Institute of Technology {zhangzd, sze}@mit.edu Abstract State-of-the-art super-resolution