680 IEEE TRANSACTIONS ON BROADCASTING, VOL. 54, NO. 3, SEPTEMBER 2008

Size: px

Start display at page:

Download "680 IEEE TRANSACTIONS ON BROADCASTING, VOL. 54, NO. 3, SEPTEMBER 2008"

Dwight Ramsey
5 years ago
Views:

1 680 IEEE TRANSACTIONS ON BROADCASTING, VOL. 54, NO. 3, SEPTEMBER 2008 A New Objective Quality Metric for Frame Interpolation Used in Video Compression Kai-Chieh Yang, Student Member, IEEE, Ai-Mei Huang, Student Member, IEEE, Truong Q. Nguyen, Fellow, IEEE, Clark C. Guest, Member, IEEE, and Pankaj K. Das, Member, IEEE Abstract A new metric for assessing spatial quality of the temporally interpolated frames is proposed. It is designed to overcome disadvantages found in several widely used metrics for the frame interpolation application. The proposed metric quanties the severities of several pronounced interpolation artacts and combines them along with human visual factors to form a global (i.e. frame-wide) quality measurement. Also, the artact amounts are used to determine additional quality impairment caused by locally aggregated distortion. Finally, these two measurements are combined into a single quality score. Subjective experiments show that the proposed metric signicantly outperforms other commonly used metrics. Index Terms Motion compensated frame interpolation, objective quality assessment, video signal processing, visual attention, visual system. I. INTRODUCTION I N MANY VIDEO applications such as broadcasting and high definition TV, motion compensated frame interpolation (MCFI) is often adopted at the decoder to improve temporal video quality by increasing the frame rate. By doing this, motion jerkiness and jitter induced by low frame rates can be effectively removed. In MCFI, the missing frames are interpolated using the received motion vector field (MVF) between two temporally adjacent reconstructed frames, denoted by and respectively. Based on the assumption of smooth motion trajectory, the th pixel in the missing frame can be represented as follows: where is the received MVF in the bitstream used to reconstruct the frame. Instead of using forward and backward predictions on the motion trajectory, this interpolation scheme takes bidirectional predictions using the received MVF divided by 2 to avoid missing part and overlap problems during frame interpolation. This method is also called the direct MCFI as it assumes that the received motion vectors (MVs) represent true motion and can be used directly. However, the received Manuscript received September 28, 2007; revised April 8, First published August 4, 2008; last published August 20, 2008 (projected). The authors are with the Department of Electrical and Computer Engineering, University of Calornia at San Diego, La Jolla, CA USA ( kcyang@ucsd.edu; aihuang@ucsd.edu; nguyent@ece.ucsd.edu; clark@ece.ucsd.edu; das@cwc.ucsd.edu). Digital Object Identier /TBC (1) MVF is often estimated using a block matching algorithm to maximize coding efficiency, rather than finding true motion. As a result, by averaging bidirectional predictions for the interpolated frame in Equation (1), visual artacts such as blocking and ghost artacts [1] can be easily observed when unreliable MVs are used. To solve this problem, several MV processing techniques at the decoder have been proposed to obtain a better MVF for MCFI. Using the assumption of a smooth MVF, a vector median filter is generally employed to remove MV outliers to obtain a smoother MVF for the interpolated frame. In [2], an adaptively weighted vector median filter exploiting prediction residues is presented. The work in [3] proposed using a finer MVF for frame interpolation to eliminate blocking artacts. That is, each received MV is resampled into four MVs with smaller block sizes using a smoothness measurement. In order to reduce ghost artacts caused by mismatched bidirectional predictions, the method in [4] adopts bidirectional MV processing for MCFI. Instead of using high complexity motion re-estimation at the decoder, the authors in [4] proposed selecting the best MV for each merged group from the neighboring MVs based on minimizing the dference between forward and backward motion compensations. They further proposed a multi-stage MV processing algorithm in [5], which corrects unreliable MVs by gradually reducing block sizes until all ghost artacts and blocking artacts can be removed effectively. All MCFI techniques assume that temporal quality of the compressed video sequences improved by MCFI can be perfectly restored to before-compression levels. However, spatial quality of interpolated images could be severely degraded. Improving this has became a major challenge to recent MCFI research. Therefore, an accurate and appropriate quality assessment scheme for interpolated video data is essential to understand the performance of dferent MCFI techniques. II. PREVIOUS WORKS Subjective evaluation [2], [3] is the most convincing approach because it collects direct responses from end users. However, it is inconvenient, expensive, and time consuming. Objective methods provide an alternate feasible solution. Most research evaluates interpolated frame quality using fidelity metrics. Reference [4] uses Peak-Signal-to-Noise-Ratio (PSNR), a normalized Mean-Square-Error (MSE) between original and processed images, to measure the interpolation quality. Reference [6] measures the quality using a Structure Similarity (SSIM) metric from [7]. SSIM is a recently developed quality metric and its high performance has been demonstrated in [7], [8]. It uses a /$ IEEE

YANG et al.: A NEW OBJECTIVE QUALITY METRIC FOR FRAME INTERPOLATION USED IN VIDEO COMPRESSION 681 Fig. 1. Example of (a) blocking and (b) ghost artacts introduced by frame interpolation. Fig. 2.

38 db respectively. combination of luminance, contrast, and pixel value correlation comparisons as the quality index.

2 YANG et al.: A NEW OBJECTIVE QUALITY METRIC FOR FRAME INTERPOLATION USED IN VIDEO COMPRESSION 681 Fig. 1. Example of (a) blocking and (b) ghost artacts introduced by frame interpolation. Fig. 2. Pixel sht example of interpolated frames, which are produced by (a) direct MCFI approach and (b) multi-stage method from [5], with the same subjective quality but PSNR = 31:42 db and db respectively. combination of luminance, contrast, and pixel value correlation comparisons as the quality index. The SSIM was originally designed for assessing image quality, but it also has been extended to video quality assessment by including motion models [9] and several human visual factors [10]. However, because of the characteristics of quality degradation caused by frame interpolation, the performance of fidelity-only measurements could be improved the following attributes are taken into account: 1) Characteristics of supra-threshold distortion: Video quality degradation can be separated into sub-, near-, and supra-threshold distortions according to its perceptibility to human vision [11]. The sub- and near-threshold classes refer to the types of distortion that are below or slightly above just-noticeable-dference (JND) respectively. Supra-threshold distortion generally appears in a structured form and is known as artacts. Blockiness and ghost artacts shown in Fig. 1(a) and (b) are known to be two major artacts in interpolated frames. These types of distortion are very irritating to human perception and dominate subjective quality judgment. One of the main challenges to supra-threshold distortion measurement is that its appearance varies with video content, and hence, human perceived annoyance is dferent even with the same error energy [12]. According to [8], PSNR is good for estimating the near-threshold quality distortion (i.e. white noise), but is not sufficient to cover supra-threshold distortion. SSIM has good performance on both near-, and supra-threshold quality degradation, but because of the appearance of artacts introduced by frame interpolation, its performance can be further improved by combining with the blockiness metric outputs. By doing this, small quality dference between frames produced by various interpolation technologies can be better recognized. 2) Various sensitivities to dferent spatial-temporal locations: Visual attention driven quality measurement and coding have become an important direction for video quality research [11], [13] [15]. This approach improves quality prediction accuracy by considering sensitivities to dferent spatial-temporal regions according to human visual attention. Hence, rather than evenly distributed weights, higher weights should be assigned to high attention regions when pooling the local spatial quality measurement. 3) Pixel sht: Some MCFI schemes excel in producing good quality for moving regions, but the side-effect is that some Fig. 3. Examples of strong local distortion that are produced by (a) direct MCFI approach with PSNR = 29:63 db and (b) multi-stage method from [5] with PSNR = 28:06 db. pixels in the static regions may be slightly changed. This pixel sht can be considered as sub-threshold distortion. It is barely noticed by human eyes and the perceived quality is good. However, PSNR usually yield a low quality score since they still count it as part of the distortion. Visual observation shows that Figs. 2(a) and 2(b) are the same quality, but PSNR gives 3 db higher score to Fig. 2(a). 4) Conspicuous local distortion: Compared the quality degradation caused by compression and frame interpolation, quality degradation caused by compression usually evenly distributes through entire frame since the compression ratios of all compression units (i.e. macroblocks) are similar. However, the quality degradation introduced by frame interpolation may highly concentrate in a small region from using unreliable MVs. Areas with this type of distortion might be small and conventional distortion quantication is low. But its large dference in quality to neighboring regions makes it very conspicuous and dramatically enlarges its annoyance to human perception. Most existing quality evaluation methods consider distortion perceptibility only from a human visual system (HVS) [16] [23] or visual attention [11], [13], [14] perspective. Very few of them take into account the effect from variable spatial quality contrast. In this case, even the factors of HVS and visual attention are included, regions with low quality scores and small area, but high spatial quality contrast are still missing and their impact is averaged out. Figs. 3(a) and 3(b) have very similar quality except for a noticeable distortion on the edge of the helmet in Fig. 3(a). However, PSNR contradictorily assigns a lower quality value to

3 682 IEEE TRANSACTIONS ON BROADCASTING, VOL. 54, NO. 3, SEPTEMBER 2008 Fig. 4. System diagram of PFIQM. Fig. 3(b). This provides strong evidence that annoyance caused by local high quality contrast must be considered. To emulate this phenomenon, several quality metrics from [22], [23] use a certain percentage of the worst scores to adjust the average scores. However, this approach is heuristic and the correct percentage is hard to determine. In order to overcome the problems described above, a novel metric Perceptual Frame Interpolation Quality Metric (PFIQM), based on SSIM, is proposed to evaluate the spatial quality degradation introduced by frame interpolation. The focus of this work is to assess spatial quality degradation caused by frame interpolation. Since the frame rate of compressed video increases after frame interpolation, temporal quality (i.e. jerkiness or jitter [1], [24]) that usually correlates with the frame rate, of interpolated sequences is assumed perfect. This paper is organized as follows. Section III describes the PFIQM in detail. Section IV evaluates the performance of PFIQM and other metrics by comparing the objective scores to subjective rating. The conclusion is given in Section V. III. PROPOSED METRIC Fig. 4 presents the system diagram of PFIQM. Denote, and as reconstructed, and interpolated frames respectively. Assume that, and. Original, reconstructed, and interpolated frames are used as inputs. This metric can be separated into two principle parts, Global Quality Estimator (GQE) and Local Distortion Estimator (LDE). The former emulates visual attention effects and estimates the frame-wide distributed distortion, and the later covers locally aggregated quality distortion while considering spatial quality contrast. A blockiness metric is used to estimate the amount of blocking artacts, whereas SSIM is used for estimating the severity of ghost artact denoted as. In the GQM, both metrics outputs are adjusted by a Conspicuousness Map based on motion and gaze centering effects. Adjusted metrics outputs, and for blockiness and ghost artacts, are normalized by neighboring reconstructed frames quality values and form the normalized quality scores and. Subsequently, these normalized quality scores are integrated into a Global Quality value. For LDE, the metrics outputs are adjusted by a Distortion Salience Map (DSM) and produce the quality score of blockiness and ghost aspects for local quality distortion measurement and. Afterward, these distortion measurements are normalized into and. Subsequently, these normalized local quality distortion measurements are combined into a Local Distortion value. Finally, both and values are integrated with the guidance of motion to form a final quality score:. A. Blockiness Estimator A well known blockiness metric from [25] is adopted to measure the amount of blocking artacts. Quality metrics can be categorized into Full-, Reduced-, and Non-Reference (FR, RR, and NR respectively) based on the accessibility of original video data [26], [27]. This blockiness metric originally was a NR metric, but it has been modied to be FR, in order to ensure high blockiness estimation accuracy. This metric calculates the pixel value discontinuity at each 8 8 boundary of both original and processed frames. Then the dference of the boundary discontinuity between original and processed frames is used as a boundary discontinuity measurement. Because blocking artacts cannot be recognized in very dark or bright lighting conditions [26], the discontinuity is weighted by a luminance masking function. The adjusted pixel discontinuity is normalized by the inter-block pixel dference to emulate the texture masking phenomenon [27]. Let the original and processed images be and respectively, is the index of a pixel and and respectively, where and represent the height and width of a frame, and is the index of an 8 8 block, where and. The pixel discontinuity on the vertical block boundary,, is estimated by where, is the norm, and is the output of a luminance masking function used to adjust (2)

4 YANG et al.: A NEW OBJECTIVE QUALITY METRIC FOR FRAME INTERPOLATION USED IN VIDEO COMPRESSION 683 the perceptual importance of each boundary discontinuity. The luminance masking function is defined as where otherwise and represents the most suitable lighting condition in an 8-bit scale for human visual perception, which is set as 81. Parameters and are the mean value and standard deviation of the pixels on the same row within two adjacent blocks respectively, and are given by (3) (4) B. Similarity Estimator SSIM is used to estimate the severity of ghost artacts by measuring the deviation of pixel values and the distribution of original and processed images. First, both the original and processed images are low-pass filtered by a Gaussian filter with a window and variance. The intention of the low-pass filtering is to remove any distortion imperceivable to human eyes. The structural similarity,, of th pixel is estimated by (10) at the bottom of the page, where, and, and and are the correlation, mean, and standard deviations of each window around the th pixel in and respectively, and and are determined experimentally by [7], which are given by and respectively. The similarity map is further processed into a block basis by (5) Higher (11) value implies fewer ghost artact and better quality. and The final vertical blockiness map,, is obtained after normalizing the discontinuity with the average inter-pixel dference of the non-boundary pixels as where The horizontal blockiness map,, can be obtained with the same process as for the vertical blockiness map. The vertical and horizontal boundary based blockiness map is transformed to a block basis blockiness map by Higher value implies more blocking artacts. (6) (7) (8) (9) C. Global Quality Estimator (GQE) To implement the GQE, the blockiness and structure distortion values are first weighted by a conspicuousness map (CM). Next, the weighted metrics outputs are normalized and pooled together to yield frame level blockiness and ghost artact measurements. Finally, these two quality scores are integrated to form a Global Quality (GQ) score. 1) Conspicuousness Map Adjustment: The visual attention model usually is constructed by combining several bottom-up or top-down visual attention features [28] [31]. Among many visual features, motion is especially important in interpolated frames for two reasons. First, regions with high motion usually suffer severe quality degradation. Second, humans tend to pay more attention to moving regions. Because of these reasons and the concern of computational complexity, we adopt only motion to determine visual attention regions instead of using many visual attention features. Moreover, because of the biological structure human eyes, the central part of an image draws most of the human attention [32]. Hence, visual sensitivity to distortion decreases as spatial location moves away from the central area toward the boundary of an image. This is applicable across dferent content types and camera view. Based on these reasons, a conspicuousness map is determined by both the motion and gaze centering maps. Assuming that the bitstream information is not accessible, the MVs are obtained by processing a block-matching motion estimation [33] on the original video data. Let (10)

684 IEEE TRANSACTIONS ON BROADCASTING, VOL. 54, NO. 3, SEPTEMBER 2008 Weighted by, quality scores from both blockiness and SSIM of all blocks within a frame are averaged as and (18) Fig. 5. Response of the gaze centering map.

5 684 IEEE TRANSACTIONS ON BROADCASTING, VOL. 54, NO. 3, SEPTEMBER 2008 Weighted by, quality scores from both blockiness and SSIM of all blocks within a frame are averaged as and (18) Fig. 5. Response of the gaze centering map. Higher value represents stronger visual sensitivity and it decreases as spatial location moves toward boundary. be the motion vectors of th block in the horizontal and vertical directions. The corresponding motion activity is (12) which is normalized by the max and min motion activity value within each frame by (13) Finally, the motion map,, is obtained after is post-processed by a spatial median filter as (14) where denotes one iteration median filtering, and is a 3 3 block mask centered at. On the other hand, for frames with very few motion such that, the quality score of all blocks are counted, and all are assigned to 1. A two-dimensional anisotropic Gaussian kernel,, is implemented to emulate the gaze centering phenomenon as (15) where, represents the index of the spatial central point of an image, and, are the width of Gaussian distribution in horizontal and vertical directions, which are set to 800 and 500 respectively suggested by [32]. After normalization, the final central focus map,, is given by (16) Fig. 5 shows the outputs of (16). Central regions are assigned higher values, which correspond to higher visual sensitivity. The final conspicuousness map,,in th block is given by (17) (19) 2) Normalization: Appearance of supra-threshold distortion varies with video content, and hence, outputs from the blockiness metric and SSIM may occur in dferent ranges as video content changes. This will cause many ambiguities when integrating multiple metrics outputs, and therefore, outputs from (18) and (19) must be normalized into a certain range. Another purpose of normalization is to align the semantic interpretation of metrics output. Higher values of the blockiness metric suggests lower quality, but output from SSIM is interpreted the other way around. Hence, after normalization, higher values from both metrics means better quality. Most frame interpolation techniques construct the interpolated frame by fetching part of the data from neighboring reconstructed frames. This implies that (a) the content of the interpolated frames is similar to its neighboring reconstructed frames, and (b) quality degradation due to compression in reconstructed frames may propagate to interpolated frames. Therefore, metrics outputs from the nearest backward and forward reconstructed frames are employed as normalization baselines for interpolated frames. Consider as or of the interpolated frames, the metrics outputs are normalized by (20) where and are the outputs from (18) or (19) of the closest backward and forward reconstructed frames respectively, and and are the normalization constants for global quality, which are obtained experimentally using several interpolated frames that contain global quality distortion only. Finally, the normalized value is bounded by Otherwise (21) to emulate the perceptual saturation phenomenon. Notations and represent the normalized values from both (18) and (19) for global quality assessment, and a higher value means better quality. 3) Metrics Integrator: Because and have been normalized into a similar scale, extra weighting is

6 YANG et al.: A NEW OBJECTIVE QUALITY METRIC FOR FRAME INTERPOLATION USED IN VIDEO COMPRESSION 685 and (26) Fig. 6. An example of DSM determination, which (a) a noticeable local distortion appears around the edge of the helmet, and (b) the corresponding DSM indicates a high value on those regions. not necessary. Hence, the final GQ is produced by averaging and as where higher GQ value implies better quality. D. Local Distortion Estimator (LDE) (22) Similar to the GQE, both blockiness and structural distortions are weighted according to human visual sensitivity to estimate the quality level of a frame. Dferent to the GQE, LDE determines visual sensitivity from the local distortion contrast perspective instead of motion. 1) Distortion Salience Map Adjustment: Since the appearance of strong local distortion varies, blockiness measurement might not be generic enough to cover all dferent local distortion cases. Therefore, outputs from (11) are used to determine the noticeability of aggregated quality distortion by considering local distortion contrast as (23) Post-processed by a three iteration median filtering and with the gaze centering map from (16), the final DSM of th block is (24) Fig. 6 shows an DSM extraction example. A conspicuous distortion is seen at the edges of the helmet, DSM successfully detects it and assigns high values to those regions. Only the quality scores with strong DSM are considered in local distortion estimation. The local quality distortion from blockiness and structural similarity aspects, and, are estimated as (25) where and are indices and total number of the blocks with. 2) Normalization: Consider as either or, outputs from (25) and (26) are normalized by Otherwise (27) where, are the normalization constants for local distortion, and is bounded by 1. The can be thought of as the area of strong local distortion. If is less than 11, then the local distortion is considered as not-noticeable and is assigned to 0. Unlike GQ, higher normalized quality scores, and, suggest more local distortion and worse quality. 3) Metrics Integrator: The final local distortion score is given by (28) E. GL Integrator Motion is used to determine the dominance between GQ and LD when forming a final quality score. High motion results in more artacts, and in this case, GQ is more dominant than LD. Moreover, high motion introduces temporal masking and spatial detail will be filtered out; thus, LD can be discarded. Therefore, the final quality of th interpolated frame,, is given by (29) where is the average motion activity of th frame given by and is a threshold to trigger the influence of LD. In the case that motion activity is lower than, LD is perceivable but its dominance is inversely proportioned to motion, which indicates that LD is less important as motion increases. Also, the negative sign of the LD term indicates the fact that local distortion is an additional distortion to global quality, so GQ is decreased by a weighted LD. IV. SIMULATION A. Experimental Setup A subjective test has been carried out to collect viewers perceived quality of frames interpolated by dferent MCFI schemes. Two video clips, FOREMAN [34] and WALK [35], were selected as test sequences because of their wide range of dferent motion activity and texture. The original sequences are

7 686 IEEE TRANSACTIONS ON BROADCASTING, VOL. 54, NO. 3, SEPTEMBER 2008 Fig. 7. Subjective test procedure for a test case. CIF ( ) frame resolution with an original frame rate of 30 fps. They are encoded using H.263, but even numbered frames are skipped to generate video bitstreams of 15 fps. The rate control function is disabled by fixing quantization parameter (QP) values at 10. The averaged bit rates of these two test sequences are Kbps and Kbps for FOREMAN and WALK, respectively. Fourteen interpolated frames are chosen as test materials based on the following criteria: Each selected frame has five dferent images produced by five dferent MCFI techniques direct MCFI, vector median filter [2], MV smoothing method as described in [3], MV selection similar to [4] but with fixed block size, and the multi-stage MV processing method in [5] respectively. Therefore, a total of seventy interpolated images are included in test data set. The Double Stimulus Continuous Quality Scale (DSCQS) method as standardized in ITU-R BT [36] is adopted as the subjective test method. Original frames serve as reference data and the interpolated frames are used as test data. One reference and one test datum form a test case. The procedure for a test case is illustrated in Fig. 7, where, showing either reference or test image data,, showing a gray image for buffering. The DSCQS method first presents the reference, then the test data to participants. Subsequently, this image pair is repeated but in random order and participants vote on quality score. Total duration of a test case is 20 sec, so the entire test session lasts. Half an hour has been proved as the most appropriate experimental length [36] for video subjective test to avoid tiring assessors and thus producing unreliable data. The assessors are asked to grade their perceived quality on a continuous linear scale that ranges from 1 to 5 with semantic meaning of Bad, Poor, Fair, Good, and Excellent to perceived quality. Viewers are allowed to vote with 0.1 increment. A total of ten viewers participated in the experiment composed of five expert and five non-expert viewers. The raw scores of each test case are converted to dference scores (between the test and reference image) to obtain a subjective Dference Mean Opinion Score (DMOS) value for each interpolated image, which is denoted as and higher value represents better quality. Objective metrics involved in performance comparison are the PFIQM, PSNR, and SSIM. In this experiment, two outputs of PFIQM, Q from (29) and GQ from (22), are used as two dferent metrics to evaluate the quality prediction accuracy with and without considering local quality distortion. Since objective outputs are content dependent and also for the sake of data interpretation convenience, metrics outputs are normalized by the max and min value of each sequence by (30) where denotes a metric s output of th frame, and is a set of a metric s outputs for a sequence. According to the Phase II Final Report from Video Quality Experts Group (VQEG) [37], the relationship between the metrics outputs and the may not be linear, as subjective testing can have nonlinear quality testing compression at the extremes of the test range. In order to remove any nonlinearity caused by subjective rating process and to facilitate comparison of metrics in a common analysis space, normalized metrics outputs are mapped by a nonlinear regression function as (31) where denotes the mapped objective score for th frame, and is a set of parameters obtained by fitting the of each metric against. As a result, each metric has a parameters set and the corresponding represents the objective scores that are closest to subjective ratings. The best performance of each metric can be obtained by this mapping process. After normalization and nonlinear transformation, outputs of all metrics range from 1 to 5 and higher value means better quality. The were compared with the values by computing the correspondences using the following metrics: (a) Pearson correlation coefficient : This metric is used to estimate the model prediction accuracy, which is the ability of the objective metric to predict subjective ratings with minimum average error, (32) where and, which, are the mean values of mapped objective and subjective scores respectively. Larger means higher prediction accuracy.

8 YANG et al.: A NEW OBJECTIVE QUALITY METRIC FOR FRAME INTERPOLATION USED IN VIDEO COMPRESSION 687 Fig. 8. Scatter plot of objective score vs. subjective rating of all test cases. Fig. 9. Scatter plot of objective score vs. subjective rating of test cases ii and vi only. (b) Spearman rank order correlation coefficient : This coefficient is designed to determine the level of monotonicity by measuring the correlation of decreasing(increasing) trend of both variables independently of the magnitude. The equation is TABLE I TEST MATERIAL SELECTIVE CRITERIA FOR PFIQM PERFORMANCE EVALUATION (33) where is the number of data point. Larger means better prediction performance. (c) Root-Mean-Square-Error : Root-Mean-Square- Error (RMSE) is the square root of the mean squared dference between objective and subjective values, which is TABLE II QUANTITATIVE PERFORMANCE COMPARISON FOR PFIQM, PSNR, AND SSIM (34) Lower means less deviation between subjective and objective data and better prediction performance. B. Experimental Results The data for the vs. comparison are arranged into two groups, where the first group contains all test cases and the second group only includes the test cases with local distortion (i.e. test cases ii and vi in Table I). Figs. 8 and 9, and Table II present scatter plots and quantitative correspondence measurement of these two groups of data. In the analysis of all test cases, both Q and GQ have much higher correlation and lower RMSE than PSNR and SSIM. Hence, the PFIQM signicantly outperforms both PSNR and SSIM, the two commonly used metrics in MCFI research. SSIM is better than PSNR since it has better capability to detect structural distortion. However, it is still worse than the PFIQM since it does not consider blocking artacts. Detailed comparison shows that Q has slightly better performance than GQ, since GQ does not include local distortion. In the performance analysis of test cases ii and vi only, all metrics performance drop dramatically but Q decreases least and still maintains very good performance compared to other metrics. Since GQ focuses on detecting globally spread distortion, its performance is much worse than Q. However, since GQ takes blockiness and visual attention factors into account, it still performs better than PSNR and SSIM. It is worth noting that the performance decrease rate of both PSNR and SSIM is large and similar. According to Figs. 9(c) and 9(d), both PSNR and SSIM

Examples of low global quality degradation but strong local distortion, where (a) is interpolated by vector median [2], and (b) is produced by multistage method [5]. Fig. 11.

9 688 IEEE TRANSACTIONS ON BROADCASTING, VOL. 54, NO. 3, SEPTEMBER 2008 TABLE III OBJECTIVE SCORE OF SEVERAL SAMPLE IMAGES Fig. 10. Examples of severe global distortion that (a) is interpolated by direct approach, and (b) is produced by multi-stage method [5]. Fig. 13. Examples of low global quality degradation but strong local distortion, where (a) is interpolated by vector median [2], and (b) is produced by multistage method [5]. Fig. 11. Examples of mixture global and local quality degradation that (a) is interpolated by direct approach, and (b) is produced by multi-stage method [5]. Fig. 12. Examples of same visual quality but dferent PSNR value, where (a) is interpolated by direct approach, and (b) is produced by multi-stage method [5]. are very insensitive to quality degradation due to local distortion. It is fair to say that these two metrics fail to handle this type of quality impairment. Figs show some interpolated frames produced by dferent MCFI methods, and Table III shows all metrics outputs processed by (30) and (31) for each image. In Fig. 10, visual observation gives a consensus that Fig. 10(a) fails in preserving mouth and nose structure and also suffers severe blockiness, but Fig. 10(b) has better quality. The PFIQM and SSIM successfully reflect this dference, but PSNR fails in this case. Fig. 11 shows examples of a mixture of global and local quality impairment. Both samples are highly degraded in the moving regions, but Fig. 11(a) contains more blockiness than Fig. 11(b). Also, Fig. 11(a) has a very noticeable artact on foreman s helmet and face. Among the three metrics, both PSNR and SSIM tend to assign higher quality values to Fig. 11(a), only PFIQM s results are consistent with visual observation. Fig. 12 provides an example where both Figs. 12(a) and 12(b) have very similar visual quality. The PFIQM assigns a similar score to these two images, but PSNR gives a much higher score to Fig. 12(a). In the following, an example with weak global distortion but salient local quality impairment is shown in Fig. 13. Ignoring the conspicuous blocking artacts on the helmet edge, the quality of these two images is very close. However, local distortion is counted, then Fig. 13(a) is worse than Fig. 13(b). The PFIQM s outputs agree with the visual judgment, but PSNR scores show a strong contradiction. Not surprisingly, SSIM gives similar scores for both images since it does not consider local distortion. Overall, high correlation of the PFIQM with subjective evaluation has been proven through this exercise. The second best metric is SSIM, and PSNR performs worst in measuring the quality degradation caused by frame interpolation. V. CONCLUSION This paper has investigated the quality prediction accuracy of two widely used spatial quality metrics for frame interpolation. Several disadvantages of the metrics are demonstrated, and a new metric, PFIQM, that overcomes these issues is proposed. This metric is designed based on prior knowledge about frame interpolation, such as type of artacts, possible regions of quality degradation, and the occurrence of highly conspicuous local distortion. Performance evaluation shows that the PFIQM signicantly outperforms the other metrics and is highly consistent to subjective rating. The PFIQM will be further improved by adding more human visual factors, such as texture and temporal information. Moreover, the naturalness of motion for interpolated frames will also be considered.

YANG et al.: A NEW OBJECTIVE QUALITY METRIC FOR FRAME INTERPOLATION USED IN VIDEO COMPRESSION 689 REFERENCES [1] M. Yuen and H. R. Wu, A survey of hybrid MV/DPCM/DCT video coding artacts, Signal Processing, pp.

[3] G. Dane and T. Q. Nguyen, Smooth motion vector resampling for standard compatible video post-processing, in Proc. Asilomar Conf. Signals, Systems and Computers, Nov. 2004, vol. 2, pp. 1731 1735.

IEEE ICIP 07, 2007. [6] J. Wang, N. Patel, and W. Grosky, A fast block-based motion compensation video frame interpolation approach, in Proc. Asilomar Conf. Signals, Systems and Computers, 2004, pp.

10 YANG et al.: A NEW OBJECTIVE QUALITY METRIC FOR FRAME INTERPOLATION USED IN VIDEO COMPRESSION 689 REFERENCES [1] M. Yuen and H. R. Wu, A survey of hybrid MV/DPCM/DCT video coding artacts, Signal Processing, pp , [2] L. Alparone, M. Barni, F. Bartolini, and V. Cappellini, Adaptively weighted vector-median filters for motion-fields smoothing, in Proc. ICASSP 96, May 1996, vol. 4, pp [3] G. Dane and T. Q. Nguyen, Smooth motion vector resampling for standard compatible video post-processing, in Proc. Asilomar Conf. Signals, Systems and Computers, Nov. 2004, vol. 2, pp [4] A.-M. Huang and T. Q. Nguyen, A novel motion compensated frame interpolation based on block-merging and residual energy, in Proc. IEEE MMSP 06, 2006, pp [5] A.-M. Huang and T. Q. Nguyen, A novel multi-stage motion vector processing method for motion compensated frame interpolation, in Proc. IEEE ICIP 07, [6] J. Wang, N. Patel, and W. Grosky, A fast block-based motion compensation video frame interpolation approach, in Proc. Asilomar Conf. Signals, Systems and Computers, 2004, pp [7] Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, Image quality assessment: From error visibility to structural similarity, IEEE Trans. Image Processing, vol. 13, pp , April [8] H. Sheikh, M. Sabir, and A. Bovik, A statistical evaluation of recent full reference image quality assessment algorithms, IEEE Trans. Image Processing, vol. 15, no. 11, pp , Nov [9] K. Seshadrinathan and A. Bovik, A structural similarity metric for video based on motion models, Proc. IEEE Acoustics, Speech and Signal Processing, pp. I-869 I-872, [10] Z. Wang, L. Lu, and A. Bovik, Video quality assessment based on structural distortion measurement, Signal Processing: Image Communication, special issue on objective video quality metrics, vol. 19, pp , [11] Z. Lu, W. Lin, X. Yang, E. Ong, and S. Yao, Modeling visual attention s modulatory aftereffects on visual sensitivity and quality evaluation, IEEE Trans. Image Processing, vol. 14, pp , Nov [12] M. S. Moore, J. M. Foley, and S. K. Mitra, Detectability and annoyance value of MPEG-2 artacts inserted in uncompressed video sequences, Proc. SPIE: Human Vision and Electronic Imaging V, vol. 3959, pp , June [13] R. Barland and A. Saadane, Reference free quality metric using a region-based attention model for JPEG-2000 compressed images, Proc. SPIE: Image Quality and System Performance III, vol. 6059, pp , Jan [14] S. Lee, M. Pattichis, and A. Bovik, Foveated video quality assessment, IEEE Trans. Multimedia, vol. 4, pp , March [15] Z. Wang, L. Lu, and A. Bovik, Foveation scalable video coding with automatic fixation selection, IEEE Trans. Image Processing, no. 2, Feb [16] S. Winkler, A perceptual distortion metric for digital color video, Proc. of the SPIE, vol. 3644, pp , January [17] M. A. Marsy and S. S. Hemami, A metric for continuous quality evaluation of compressed video with severe distortions, Signal Processing: Image Communication, vol. 19, pp , [18] J. Lubin, A visual discrimination model for imaging system design and evaluation, Vision Models for Target Detection and Recognition, pp , [19] S. Daly, The visible dferences predictor: An algorithm for the assessment of image fidelity, Proc. of the SPIE Human Vision, Visual Processing, and Digital Display III, vol. 1666, pp. 2 15, January [20] A. B. Watson, Toward a perceptual video quality metric, Proc. of the SPIE Human Vision and Electronic Imaging III, vol. 3299, pp , July [21] K. Tan and M. Ghanbari, A multi-metric objective picture-quality measurement model for MPEG video, IEEE Trans. Circuits and Systems for Video Technology, vol. 10, pp , Oct [22] S. Wolf and M. H. Pinson, Spatial-temporal distortion metrics for in-service quality monitoring of any digital video system, Proc. SPIE, vol. 3845, pp , September [23] M. Pinson and S. Wolf, A new standardized method for objectively measuring video quality, IEEE Trans. Broadcasting, no. 3, Sep [24] K. Yang, C. Guest, K. El-Maleh, and P. Das, Perceptual temporal quality metric for compressed video, IEEE Trans. Multimedia, vol. 9, pp , Nov [25] H. Wu and M. Yuen, A generalized block-edge impairment metric for video coding, IEEE Signal Processing Letters, vol. 4, pp , Nov [26] Digital Video Quality: Vision Models and Metrics, S. Winkler, Ed. : John Wiley and Sons, March [27] S. Winkler, Perceptual Video Quality Metrics A Review, H. Wu and K. Rao, Eds. Boca Raton, FL: CRC, November [28] K. Yang, C. Guest, and P. Das, Human visual attention map for compressed video, in Proc. IEEE International Symposium on Multimedia, Dec. 2006, pp [29] U. Rajashekar, I. Linde, A. Bovik, and L. Cormack, Foveated analysis of image features at fixations, Vision Research, vol. 47, no. 25, November 2007, pp., vol. 47, no. 25, pp., Nov [30] L. Itti, C. Koch, and E. Niebur, A model of saliency-based visual attention for rapid scene analysis, IEEE Trans. Pattern Analysis Machine Intelligence, vol. 20, no. 11, pp , Nov [31] X. Yang, W. Lin, Z. Lu, X. Lin, S. Rahardja, E. Ong, and S. Yao, Rate control for videophone using local perceptual cues, IEEE Trans. Circuits Systems Video Technology, no. 4, April [32] J. McCarthy, M. Sasse, and D. Miras, Sharp or smooth? comparing the effects of quantization vs. frame rate for streamed video, in Proc. SIGCHI conference on Human factors in computing systems, 2004, pp [33] R. Li, B. Zeng, and M. L. Liou, A new three-step search algorithm for block motion estimation, IEEE Trans. Circuit System and Video Technology, vol. 4, no. 4, pp , Aug [34] Xiph.org Test Media, [Online]. Available: video/derf/ [35] Video Processing Lab. in UCSD, [Online]. Available: [36] ITU-R Recommendation BT , Methodology for the Subjective Assessment of the Quality of Television Pictures, International Telecommunication Union, [37] ITU-R Document 6Q/14, Final Report From the Video Quality Experts Group on the Validation of Objective Models of Video Quality Assessment, Phase II (FR-TV2) 2003 [Online]. Available: vqeg.org Kai-Chieh Yang received the B.S. degree in mechanical engineering from Yuan-Ze University, Taiwan, in He finished his M.S. and Ph.D. degree in electrical engineering from University of Calornia at San Diego (UCSD), La Jolla, in 2004 and 2007 respectively. He is now with Qualcomm Inc., San Diego, USA. His research interest includes computer vision, video perceptual coding, and digital video quality assessment. Ai-Mei Huang received the B.S. degree in communications engineering from National Chiao Tung University, Hsien Chu, Taiwan, in 1999, and the M.S. degree in electrical engineering from National Taiwan University, Taipei, Taiwan in She is currently a Ph.D. student in University of Calornia, San Diego, La Jolla, CA. Her research interests include signal/image processing, video coding techniques, and video applications in broadcasting and TV display. Truong Q. Nguyen received the B.S., M.S., and Ph.D degrees in electrical engineering from the Calornia Institute of Technology, Pasadena, in 1985, 1986 and 1989, respectively. He was with MIT Lincoln Laboratory from June 1989 to July 1994, as a member of technical staff. During the academic year , he was a visiting lecturer at MIT and an adjunct professor at Northeastern University. Since August 1994 to July 1998, he was with the ECE Dept., University of Wisconsin, Madison. He was with Boston University from August 1996 to June He is currently a Professor at the ECE Dept., UCSD.

690 IEEE TRANSACTIONS ON BROADCASTING, VOL. 54, NO. 3, SEPTEMBER 2008 His research interests are video processing algorithms and their efficient implementation. He is the coauthor (with Prof.

and filter bank design. He has over 200 publications. Prof.

11 690 IEEE TRANSACTIONS ON BROADCASTING, VOL. 54, NO. 3, SEPTEMBER 2008 His research interests are video processing algorithms and their efficient implementation. He is the coauthor (with Prof. Gilbert Strang) of a popular textbook, Wavelets & Filter Banks, Wellesley-Cambridge Press, 1997, and the author of several matlab-based toolboxes on image compression, electrocardiogram compression and filter bank design. He has over 200 publications. Prof. Nguyen received the IEEE Transaction in Signal Processing Paper Award (Image and Multidimensional Processing area) for the paper he co-wrote with Prof. P. P. Vaidyanathan on linear-phase perfect-reconstruction filter banks (1992). He received the NSF Career Award in 1995 and is currently the Series Editor (Digital Signal Processing) for Academic Press. He served as Associate Editor for the IEEE TRANSACTION ON SIGNAL PROCESSING from , for the IEEE TRANSACTION ON CIRCUITS AND SYSTEMS from and for the IEEE TRANSACTION ON IMAGE PROCESSING from He is a Fellow of the IEEE. Clark C. Guest joined the UCSD faculty in At the Jacobs School he runs the Optical Data Processing and Communications Group. He is undergraduate faculty mentor for the Continuous Observation and Remote sensing AUAV experiment, or CORAX, a CalSpace project aimed at enabling students to conduct low-cost space research. He received his Ph.D. from the Georgia Institute of Technology in Pankaj K. Das came to the Jacobs School in July He is Professor Emeritus, Rensselaer Polytechnic Institute. He has had 40 research contracts over the years from Government Agencies. Das has authored the textbooks on lasers and optical engineering, optical signal processing, and related topics. He received his Ph.D. in electrical engineering at the University of Calcutta in 1964.

Image Quality Assessment Techniques: An Overview

Image Quality Assessment Techniques: An Overview Shruti Sonawane A. M. Deshpande Department of E&TC Department of E&TC TSSM s BSCOER, Pune, TSSM s BSCOER, Pune, Pune University, Maharashtra, India Pune