Modeling Rate and Perceptual Quality of Scalable Video as Functions of Quantization and Frame Rate and Its Application in Scalable Video Adaptation

Size: px

Start display at page:

Download "Modeling Rate and Perceptual Quality of Scalable Video as Functions of Quantization and Frame Rate and Its Application in Scalable Video Adaptation"

Reynard Nichols
6 years ago
Views:

1 Modeling Rate and Perceptual Quality of Scalable Video as Functions of Quantization and Frame Rate and Its Application in Scalable Video Adaptation (Invited Paper) Yao Wang, Zhan Ma, Yen-Fu Ou Dept. of Electrical and Computer Engineering Polytechnic Institute of NYU, Brooklyn, NY 2, U.S.A {zma3, Abstract This paper investigates the impact of frame rate and uantization on the bit rate and perceptual uality of a scalable video with temporal and uality scalability. We propose a rate model and a uality model, both in terms of the uantization stepsize and frame rate. The uality model is derived from our earlier uality model in terms of the PSNR of decoded frames and frame rate. Both models are developed based on the key observation from experimental data that the relative reduction of either rate and uality when the frame rate decreases is uite independent of the uantization stepsize. This observation enables us to express both rate and uality as the product of separate functions of uantization stepsize and frame rate, respectively. The proposed rate and uality models are analytically tractable, each reuiring only two content-dependent parameters. Both models fit the measured data very accurately, with high Pearson correlation. We further apply these models for rate-constrained bitstream adaptation, where the problem is to determine the optimal combination of uality and temporal layers that provides the highest perceptual uality for a given bandwidth constraint. Index Terms Rate prediction, video uality metric, scalable video adaptation, scalable video coding (SVC) I. INTRODUCTION Scalable video coding (SVC) refers coding a video into an embedded bit stream that has a high uality when completely decoded, and has a lower uality when the bit stream is truncated. When a video is coded into a scalable stream with spatial, temporal, and amplitude scalability, the same video content may be delivered with varying frame rate or frame size or uantization stepsizes, depending on the substainable transmission rate, display resolution, and battery status (for battery-powered devices) at the receiver. Scalable video is particularly attractive for video multicast, where receivers of the same video often have different sustainable transmission rates with the server and varying decoding and display capabilities. Even for unicast, SVC allows the server to store just one bitstream, but send different portions of the stream to receivers with different bandwidth and energy resources. Amplitude scalability defined here is conventionally known as uality or SNR scalability. To avoid the confusion with the overall perceptual uality of a video at different resolutions, we use the term Amplitude Scalability. To deliver a pre-coded scalable bitstream to heterogeneous receivers with varying bandwidth constraints, either the sender or a transcoder at a proxy needs to extract from the original bitstream a certain spatial, temporal, and amplitude layers to meet the bandwidth constraint of a particular receiver (or a group of receivers with similar rate constraints). This problem is generally known as rate-constraint bit stream adaptation. For a given target bit rate, one may choose to extract the layers leading to a high frame rate, large frame size, but low uality in each decoded frame (noticeable coding artifacts), or a low frame rate, small frame size, but high uality per frame, or other combinations of spatial, temporal, amplitude resolutions. Different combinations are likely to yield different perceptual uality. A major challenge for deploying scalable video lies in how to perform the adaptation efficiently, while maximizing the perceptual uality. The latest scalable video coding (SVC) standard [] enables lightweight bitstream manipulation [2] and also can provide the state-of-the-art coding performance [3], by its network friendly interface design and efficient compression schemes inherited from the H.264/AVC [4]. However, before SVC video can be widely deployed for practical applications, efficient mechanisms for SVC stream adaptation to meet different user constraints need to be developed. Optimal adaptation reuires accurate prediction of the perceived uality as well as the total rate at any combination of spatial, temporal and amplitude (STA) resolutions. Although much work has been done in perceptual uality modeling and in rate modeling for single layer video or video with amplitude scalability only, the impact of spatial and temporal resolutions, together with amplitude resolution, individually and jointly, on the perceptual uality and rate has not been studied extensively. Recently, several studies have examined the influence of spatial, temporal, and amplitude resolutions, individually or jointly, on the perceptual uality [5], [6], [7], [8]. However, some of these models reuire a lot of parameters, or have limited accuracy. To the best of our knowledge, none of the prior work in scalable video adaptation have attempted to predict the rates corresponding to different layer combinations. Rather these studies make use of the actual rates associated

2 with different layers. Without analytical rate models, the solution of optimal layer combination has to be done through exhaustive search, to see which combination leads to the highest rate-uality slope while meeting the rate constraint. In certain applications, the adaptation decision needs to be made at the receiver and feedback to the server. In such situations, the rates associated with all possible layer combinations have to be delivered to the receiver, reuiring extra bandwidth and delay. Having an accurate rate model, together with an accurate uality model, would enable one to determine the optimal STA combination for a given rate constraint efficiently. In this paper, we focus on modeling the impact of temporal and amplitude resolutions (in terms of frame rate and uantization stepsize, respectively) on both rate and uality. We further apply these models for solving the rate-constrained SVC adaptation problem assuming the spatial resolution is determined based on other considerations (e.g. display size of the receiver). We defer the consideration of the spatial resolution for future study. Our uality model relates the perceptual uality with the uantization stepsize and frame rate. It is derived based on our prior work, which uses the product of a metric that assesses the uality of a uantized video at the highest frame rate, based on the PSNR of decoded frames, and a temporal correction factor for uality (TCFQ), which reduces the uality assigned by the first metric according to the actual frame rate. In the uality model proposed here, we replace the first term by a metric that relates the uality of the highest frame rate video with the uantization stepsize. Each term has a single parameter, and the overall model is shown to fit very well with the subjective ratings, with an average Pearson correlation of.984 over four test seuences. Our rate model predicts the rate from uantization stepsize and frame rate. It also uses the product of a metric that describes how the rate changes with the uantization stepsize when the video is coded at the highest frame rate, and a temporal correction factor for rate (TCFR), which corrects the predicted rate by the first metric based on the actual frame rate. As with the uality model, it has two parameters only and fits the measured rates of decoded SVC video from different temporal and amplitude layers very accurately (with an average Pearson correlation of.998 over four seuences). In the reminder of this paper, we present the proposed rate model in Sec. II, and the uality model in Sec. III. Using these two developed models. We address the problem of rateconstrained bit stream adaptation in Sec. IV. Sec. V concludes the paper. II. RATE MODEL In this section, we develop a rate model R(, t), which relates the rate R with the uantization stepsize and frame rate t. To the best of our knowledge, no prior work has considered the joint impact of frame rate and uantization on the bit rate. However, several prior works have considered rate modeling in non-scalable video, and have proposed models that relate the average bit rate versus uantization stepsize. Ding and Liu reported the following model [9], R = θ γ, () where θ and γ are model parameters, with γ 2. Chiang and Zhang [] suggested the following model R = A + A 2 2, (2) This so-called uadratic rate model has been used for ratecontrol in MPEG-4 reference encoder []. We note that by choosing A and A 2 appropriately, the model in (2) can realize the inverse power model of () with any γ (, 2). Only the uadratic term was included in the model by Ribas-Cobera and Lei [2], i.e., R = A 2. (3) More recently, He [3] proposed the ρ-model, R(QP) =θ ( ρ(qp)), (4) with ρ denoting the percentage of zero uantized transform coefficients with a given uantization parameter. This model has been shown to have high accuracy for rate prediction. A problem with the ρ-model is that it does not provide explicit relation between QP and ρ. Therefore, it does not lend itself to theoretical understanding of the impact of QP on the rate. In our work on rate modeling, we focus on the impact of frame rate t on the bit rate R, under the same uantization stepsize ; while using prior models to characterize the impact of on the rate, when the video is coded at a fixed frame rate. Towards this goal, we recognize that R(, t) can be written as R(, t) =R max R (; t max )R t (t; ), (5) where R max = R( min,t max ) is the maximum bit rate obtained with a chosen minimal uantization stepsize min and a chosen maximum frame rate t max ; R (; t max )= R(, t max) R( min,t max ) is the normalized rate vs. uantization stepsize (NRQ) under the maximum frame rate t max, and R(, t) R t (t; ) = R(, t max ) is the normalized rate vs. temporal resolution (NRT) under the same uantization stepsize. Note that the NRQ function R (; t max ) describes how does the rate decreases when the uantization stepsize increases beyond min, under the frame rate t max ; while the NRT function R t (t; ) characterizes how does the rate reduces when the frame rate decreases from t max, under the same uantization stepsize. We also call R t (t; ) the temporal correction factor for rate (TCFR), as it describes how to correct the rate estimate by R max R (; t max ) based on the actual temporal resolution. As will be shown later by experimental data, the impact of and t on the bit rate is actually separable, so that R t (t; ) can be represented by a

3 .8.6 b= = 4 = 4.2 = b=.67.4 = 4 = 4.2 = b= = 4 = 4.2 = b= = 4 = 4.2 = Fig.. Normalized rate vs. temporal resolution (NRT) using different uantization stepsize (). Points are measured rates, curves are predicted rates by the model of E. 6. function of t only, denoted by R t (t), and R (; t) by a function of only, denoted by R (). To see how uantization and frame rate respectively influence the bit rate, we encoded several test videos using the SVC reference software JSVM92 [4] and measured the actual bit rates corresponding to different and t. Specifically, four video seuences,,, and, all in CIF ( ) resolution, are encoded into 5 temporal layers using dyadic hierarchical prediction structure, with frame rates.875, 3.75, 7.5, 5, and 3 Hz, respectively, and each temporal layer contains 5 CGS layers obtained with uantization parameter (QP) of 44, 4, 36, 32, Using the H.264 mapping between and QP, =2 (QP 4)/6, the corresponding uantization stepsizes are 4, 64, 4, 26, 6, respectively. The bit rates of all layers are collected and normalized by the rate at the highest frame rate, i.e., t max =3Hz, to find NRT points R t (t; ) = R(, t)/r(, t max ), for all t and considered, which are plotted in Fig.. As shown in Fig., the NRT curves obtained with different uantization stepsizes overlap with each other, and can be captured by a single curve uite well. Similarly, the NRQ curves R (; t) =R(, t)/r( min,t) for different frame rates t are also almost invariant with the frame rate t, as shown in Fig. 2. These observations suggest that the effects of and t on the bit rate are separable, i.e., the uantization-induced rate variation is independent of the frame rate and vice verse. Therefore, the overall rate modeling problem is divided into two parts, one is to devise an appropriate functional form for R t (t), so that it can model the measured NRT points for all in Fig. accurately, the other is to derive an appropriate functional form for R () 2 Different from the JSVM default configuration utilizing different QPs for different temporal layers, the same QP is applied to all temporal layers at each CGS layer. that can accurately model the measured NRQ points in Fig. 2 for t = t max. Note that in fact, the R () model fits the NRQ points obtained at all different t. We assume that R max can be easily measured by coding a video at chosen min and t max. Generally for given min and t max, R max depends on the video content. The modeling of the relation of R max with the video content is beyond the scope of this paper. The derivation of the models R () and R t (t) are explained in detail as follows. A. Model for the Temporal Correction Factor for Rate (TCFR) R t (t) As explained earlier, R t (t) is used to describe the reduction of the normalized bit rate as the frame rate reduces. Therefore, the desired property for the R t (t) function is that it should be at t = t max and monotonically reduces to at t =. Based on the measurement data in Fig., we choose a power function, i.e., ( ) b t R t (t) =. (6) t max Figure shows the model curve using this function along with the measured data. The parameter b is obtained by minimizing the suared error between the modeled rates and measured rates. It can be seen that the model fits the measured data points very well. We also tried some other functional forms, including logarithmic and inverse falling exponential. We found that the power function yields the least fitting error. B. Model for vs. Quantization R () Analogous to the R t (t) function, R () is used to describe the reduction of the normalized bit rate as the uantization stepsize increases at a fixed frame rate. The desired property for the R () function is that it should be at = min and monotonically reduces to as goes to infinity. Based on the measurement data in Fig. 2, we choose an inverse power function, i.e., ( ) a R () =. (7) min Figure 2 shows the model curve using this function along with the measured data. It can be seen that the model fits the measured data points very well. The parameter a characterizes how fast the bit rate reduces when increases. Interestingly all four test seuences have very similar a values. We also tried some other functional forms, including falling exponential. We found that the inverse power function yields the least fitting error. We note that the model in (7) is consistent with the model proposed by Ding and Liu [9], i.e., E. (), for nonscalable video, where they have found that the parameter a is in the range of -2. C. The Overall Rate Model Combining Es. (5), (6), and (7), we propose the following rate model ( ) a ( ) b t R(, t) =R max, (8) min t max

4 a =.23 t =.875 Hz t = 3.75 Hz t = 7.5 Hz t = 5 Hz t = 3 Hz a =.234 t =.875 Hz t = 3.75 Hz t = 7.5 Hz t = 5 Hz t = 3 Hz a =.94 t =.875 Hz t = 3.75 Hz t = 7.5 Hz t = 5 Hz t = 3 Hz a =.28 t =.875 Hz t = 3.75 Hz t = 7.5 Hz t = 5 Hz t = 3 Hz Bit Rate [kbps] Bit Rate [kbps] = 4 = 4 = 26 Rate Model = 4 2 = 4 = 26 Rate Model Bit Rate [kbps] Bit Rate [kbps] = 4 = 4 = 26 Rate Model = 4 2 = 4 = 26 5 Rate Model Fig. 2. Normalized rate vs. uantization stepsize (NRQ) using different frame rates t. Points are measured rates, curves are predicted rates by the model of E. (7). Fig. 3. (8). Experimental rate points and predicted rates using the rate model where min and t max should be chosen based on the underlying application, and R max is the actual rate when coding a video at min,t max, and a and b are the model parameters. The actual rate data of all test seuences with different combinations of and t, and the corresponding estimated rates via the proposed model (8) are illustrated in Fig. 3, we note that the model predictions fit very well with the experimental rate points. The model parameters, a and b, are obtained by minimizing the root mean suared errors (RMSE) between the measured and predicted rates corresponding to all and t. Table I lists the parameter values. Also listed are the fitting error in terms of relative RMSE/R max, and the Pearson correlation (PC) bewteen measured and predicted rates, defined as r xy = n x i y i x i yi n x 2 i ( x i ) 2 n y 2 i ( y i ) 2, (9) where x i and y i are the measured and predicted rates, and n is the total number of available samples. We see that the model is very accurate for all four seuences, with very small relative RMSE and very high PC. TABLE I PARAMETERS FOR THE RATE MODEL AND MODEL ACCURACY a b RMSE/R max.54%.67%.25%.54% PC Note that parameter a characterizes how fast the bit rate reduces when increases. A larger a indicates a faster drop rate. Interestingly all four test seuences have uite similar a values. This implies that a is almost independent of video content. When we set a =.2 for all four seuences, we also get uite accurate rate prediction. In practice, in order to avoid the estimation or specification of the parameter a, it may be preferable to use a fixed value for a. Parameter b indicates how fast the rate drops when the frame rate decreases, with a larger b indicating a faster drop. As expected, the Football seuence, which has higher motion, has a larger b and Akiyo, has the least b. In scalable video adaptation where a full-resolution scalable stream is already generated, one can easily derive the model parameters from the rates corresponding to several different (t, ) combinations using least suares fitting. In applications reuiring estimation of model parameters from the original video seuence (e.g. for encoder optimization), it will be important to characterize the relation between a, b and some content features. Study of the relation between the model parameters and video content will be a subject of our future research. III. QUALITY MODEL There are several published works examining the impact of either frame rate alone or both frame rate and uantization artifacts on the perceptual uality. Quan and Ghanbari [7] consider the impact of both regular and irregular frame drops and examine the jerkiness and jitter effects caused by different levels of strength, duration and distribution of the temporal impairment. Besides the study of frame rate impact on perceptual uality, Feghali et al. proposed a video uality metric [6], [8] investigating both frame rate and uantization effects. Their metric uses a weighted sum of two terms, one is the PSNR of the interpolated seuences from the original low frame-rate video, another is the frame-rate reduction. The weight depends on the motion attributes of the seuences. The work in [5] extended that of [8] by employing a different motion feature in the weight. The authors of [5] propose to use

5 computational models, which emulate human visual perception based on block-fidelity, content richness, spatial-textural, color and temporal mask. Although the model have been shown to have a good correlation with subjective uality, it reuires significant computation. Our uality model is extended from our earlier work [6]. Like the rate model, we focus on examining the impact of frame rate on the uality, under the same uantization stepsize; while trying to use prior models to characterize the impact of uantization stepsize on the uality, when the video is coded at a fixed frame rate. The proposed model is written generally as Q(, t) =Q max Q (; t max )Q t (t; ), () where Q max = Q( min,t max ), Q (; t max )=Q(, t max )/Q( min,t max ) is the normalized uality versus uantization stepsize (NQQ) under the maximum frame rate t max ; Q t (t; ) =Q(, t)/q(, t max ) is the normalized uality vs. temporal resolution (NQT) under the same uantization stepsize. Note that Q max Q (; t max ) models the impact of uantization on the uality when the video is coded at the highest frame rate t max ; while Q t (t; ) describes how the uality reduces when the frame rate reduces, under the same. In other words, Q t (t; ) corrects the predicted uality by Q max Q (; t max ) based on the actual frame rate, and for this reason is also called Temporal Correction Factor for Quality (TCFQ). To derive the appropriate functional forms for Q (; t max ) and Q t (t; ), we conducted subjective tests to obtain mean opinion scores (MOS) for the same test seuences used for deriving the rate model, but the subjective tests were performed only for 64 decoded seuences, at frame rates of 3, 5, 7.5, 3.75 Hz, and QP euals to 28, 36, 4, and 44 (corresponding to uantization stepsize of 6, 4, 64, 4, respectively). The subjective uality assessment is carried out using a protocol similar to ACR-HR (Absolute Category Rating with Hidden Reference) described in [7]. In the test, a subject is shown one video at a time, providing an overall rating after each clip is played completely. The rating scale ranges from (worst) to (best). There are on average 2 ratings for each processed video seuence. Details about the subjective tests can be found in [6]. To see how the normalized uality ratings Q (; t) and Q t (t; ) vary with and t, respectively, Figures 4 and 5 show the measured data from our subjective tests. Unlike the rate data, where the effects of uantization stepsize and frame rate t are uite separable, there are noticeable interactions between t and in their impact on the perceptual uality. This interaction in fact is well known, but not well understood. However, as seen in Fig. 4, the effect of on the NQT curves Q t (t; ) is inconsistent and relatively small. Also these variations may be in part due to viewer inconsistency during the subject tests. To reduce the model complexity, we choose to model the Q t (t; ) curves by a function of t only, denoted by Q t (t). For the model for R (; t max ), we use only the measured NQQ data at the frame rate t max. In [6], we used the inverted exponential function for the NQT function, i.e., t tmax Q t (t) = e d e d. () The model curve is shown along with the measured NQT points in Fig. 4. We see that the model is uite accurate. Normalized MOS Normalized MOS.8.6 Akiyo.4 = 4.2 = 4 d= Crew = 4.2 = 4 d= Normalized MOS Normalized MOS.8.6 City.4 = 4.2 = 4 d= Football = 4.2 = 4 d= Fig. 4. Normalized uality against frame rate, for different uantization stepsize. Points are measured data, the curve is based on the model in E. (). To model the variation of the perceptual uality with uantization when the video is coded at a fixed frame rate t max, in our earlier work [6], we assume that under the same uantization parameter, the PSNR of decoded frames at frame rate t max would be similar to the PSNR of decoded frames at a reduced frame rate t. So we use PSNR computed at frame rate t to estimate the uality of the video coded at t max. Based on the prior work in [8], we use a sigmoidal function to relate the PSNR with the perceptual uality, with two parameters. In the current work, based on measured NQQ points Q (; t max ) shown in Fig. 5, we propose to use an exponential function to capture the uality variation with at the highest frame rate t max, i.e., Q () =e c e c min, (2) with c as the model parameter. Compared with the original two parameter sigmoid function proposed in [6], the single parameter exponential function is simpler and easier to analyze. Comparing the measured and predicted uality ratings shown in Fig. 5, we see that the model captures the uantizationinduced uality variation very well at the highest frame rate.

6 Akiyo City Akiyo City NOrmalized MOS Hz 3Hz c= Crew NOrmalized MOS Hz 3Hz c= Football MOS 6 4 = 4 2 = 4 Model curve 2 3 Crew MOS 6 4 = 4 2 = 4 Model curve 2 3 Football NOrmalized MOS Hz 3Hz c= NOrmalized MOS Hz 3Hz c= MOS 6 4 = 4 2 = 4 Quality Model 2 3 MOS 6 4 = 4 2 = 4 Model curve 2 3 Fig. 5. Normalized uality versus the uantization stepsize for different frame rates t. Points are measured data and the curve is the predicted uality for t = t max =3Hz, using E. (2). Combing Es. (), () and (2), the overall video uality model can be expressed as min e c e d t tmax Q(, t) =Q max e c e d. (3) Note that Q max is the MOS given for the video at min and t max. Generally, this value can be estimated by some preliminary subjective tests. In our subjective tests, the ratings are given in the range of to. But the viewers seldom give a rating of, even for very high uality video, as is commonly observed in subjective tests. What is surprising and fortunate is that the MOS values for the videos coded at min and t max are very close to each other for all four test seuences, about 89, as shown in Fig. 7. Therefore, we set Q max to 89 in our model. Note that on the more common MOS scale of to 5, 89 out of to would correspond to a MOS of.89 4+=4.56. Figure 6 compares the measured and predicted uality ratings by the model in (3). The parameters c, d are obtained by least suare error fitting. Table II summarizes the parameters and the model accuracy in terms of RMSE and Pearson correlation (PC) values for the four seuences. Overall, the proposed model, with only two content-dependent parameters, predicts the MOS very well, for seuences Akiyo and Crew, with a very high PC (>.99). The model is less accurate for Football and City, but still has a uite high PC. We would like to point out that the measured MOS data for these two seuences do not follow a consistent trend at some uantization levels, which may be due to the limited number of viewers participating the subjective tests. Note that parameter c indicates how fast the uality drops with increasing, with a larger c suggesting a faster drop. On the other hand, parameter d reveals how fast the uality Fig. 6. Video uality model (3) in terms of uantization stepsize and frame rate, the discrete points are the measured MOS data for different uantization steps. TABLE II PARAMETERS FOR THE QUALITY MODEL AND MODEL ACCURACY c d RMSE/Q max.55%.67%.25%.54% PC reduces as the frame rate decreases, with a smaller d corresponding to a faster drop. Our prior work [6], [9] has shown that parameter d is closely related to some motion attributes of the video. Derivation of the model parameters from the original or coded video is a subject of our future study. We note that the uality model parameters very much depend on the underlying viewers. In our current study, the model is derived based on MOS obtained from a relatively large group of viewers, and hence is meant to characterize an average viewer. Such models are useful when one designs a video system to optimize the perceptual uality for all potential viewers. For any particular viewer, parameters c and d are likely dependent on the viewers sensitivities to uantization artifacts and motion jerkiness, respectively. In order to optimize for individual user s perceptual uality, the model parameters should be determined based on both the video content and some viewer attributes. This is discussed further in Sec. IV. Combining the rate and uality models, we draw in Fig. 7, uality vs. rate curves achievable at different frame rates. We also plot the measured MOS data on the same figure. The model fits the measured data very well for seuences Akiyo and Crew. But the model is not as accurate at some frame rates for Football and City due to slight errors in both rate and uality prediction. It is clear from this figure, that

7 each frame rate is optimal only for a certain rate region. By connecting the segments on top for each rate region in the figure for each seuence, we effectively obtain the operational rate-uality function of the SVC encoder for that seuence. Quality Quality Hz 5Hz Hz 5Hz Quality Quality Hz 5Hz Hz 5Hz Fig. 7. Quality vs. rate at different frame rates. Points are measured data, curves are based on the rate model in E. (8) and the uality model in E. (3). IV. RATE-CONSTRAINED BIT STREAM ADAPTATION In this section, we consider how to apply our proposed rate and uality models to perform rate-constrained SVC bit stream adaptation. Figure 8 provides a system view of the adaptation problem. For each video, a single full-resolution scalable stream is available at a media content server, which will be adapted at a network proxy or gateway in response to the user channel conditions and viewing preferences. When a user reuests the video from the server, the adaptor (sitting at the proxy) will determine an appropriate video rate R based on the user s channel condition (e.g. R can the sustainable transmission rate for the given channel condition minus all the overheads for channel error correction and packetization). Based on R and the user s viewing preference setting (embedded in the user profile sent to the adaptor), the adaptor determines the optimal set of temporal and amplitude layers (more generally spatial layers) to extract, so as to provide the best perceptual uality. In Fig. 8, we assume that the adaptor monitors the channel condition based on some feedbacks from the user. (The user may inform the adaptor its desired rate R in alternative implementations.) Furthermore, it determines the uality model parameters based on the user s preference setting, which describes the user s preferred tradeoff among spatial, temporal, and amplitude resolutions. Recall that parameters c and d in the uality model depend on viewers sensitivities to temporal and amplitude resolutions. Note that the parameters of the rate model for each video can be predetermined as discussed in Sec. II, and embedded in the full-resolution bitstream. The parameters for the uality model needs to be determined based both on the video content and the viewer preference setting, as discussed in Sec. III. In a simpler implementation, the adaptor may ignore the user s preference setting, and use uality model parameters tuned for average viewers. Based on the target rate R and the model parameters, the adaptor determines the optimal frame rate t opt and uantization opt, and corresponding temporal and amplitude layers. Finally the adaptor extracts these layers from the scalable bit stream and delivers the resulting bit stream to the user. For a given target rate R, the adaptation problem can be formulated as the following constrained optimization problem, Determine t, to maximize Q(, t) subject to R(, t) R, (4) In the following subsections, we employ proposed rate and uality models to solve this optimization problem, first assuming the frame rate can be any positive value, and then considering the discrete set of frame rates afforded by the dyadic temporal prediction structure. Video Server Scalable video stream Fig. 8. Rate-constrained Bit Stream Adaptation Target rate R and model parameters feasible points,t : R(,t ) R supportable uality: Q(,t ) optimal setting: opt,t opt User video substream Rate-Constrained SVC Video Adaptation Channel condition & User profile A. Optimal solution assuming t and continuous values We first solve the constrained optimization problem in (4) assuming both the frame rate t and uantization stepsize can take on any value in the range of t (,t max ), ( min, + ). To simplify the notation, let ˆQ = Q max, ( e d )e c ˆt = t/t max, ˆ = / min, ˆR = R/Rmax, and ˆR = R /R max, the rate and uality models in (8) and (3) become respectively ˆR (ˆ, ˆt ) =ˆ aˆt b, (5) Q (ˆ, ˆt ) = ˆQe ( ) cˆ e dˆt. (6) By setting ˆR (ˆ, ˆt ) = ˆR in (5), we obtain ˆ = a (ˆt b / ˆR ), (7) which describes the feasible for a given t, to satisfy the rate constraint R. Substituting (7) into (6) yields Q(ˆt) = ˆQe a c ˆtψ ( ˆR e dˆt ), ˆt (, ) (8)

8 2 optimal vs. bit rate 2 optimal vs. bit rate 2 optimal vs. bit rate 2 optimal vs. bit rate optimal vs. bit rate optimal vs. bit rate optimal vs. bit rate optimal vs. bit rate Fig. 9. Optimal uantization stepsize, frame rate and the corresponding uality index versus the bit rate R by assuming the uantization stepsize and frame rate can take on any value within their effective range. Fig.. Optimal uantization stepsize, frame rate and the corresponding uality index versus the bit rate R by assuming the varies continuously and frame rate takes.875/3.75/7.5/5/3 Hz which is effective by using dyadic hierarchical prediction structure. where ψ = b/a. Euation (8) expresses the achievable uality with different frame rate under the rate constraint R. Clearly, this function has a uniue maximum, which can be derived by setting its derivative with respect to ˆt to zero. This yields ( ) a cψˆt ˆR ψ ( e dˆt ) =. (9) de dˆt For any given rate constraint R, we can solve (9) numerically to determine the optimal frame rate t opt. Then using (7) and (8) we can determine the optimal uantization stepsize opt, and the corresponding maximum uality Q opt at the rate R. Figure 9 shows t opt, opt, and Q opt as functions of the rate constraint R. As expected, as the rate increases, t opt increases while opt reduces, and the achievable best uality continuously improve. Notice that t opt increases more rapidly for the seuence than for the other seuences, because of its faster motion. Based on the parameters derived from our subjective test data, even at the highest bit rates examined, the optimal frame rate is below 2 Hz for the other three seuences. Note that had we used a smaller min to allow much higher values for R max, t opt would have increased to 3 Hz beyond some rates. B. Optimal solution under dyadic temporal scalability structure A popular way to implement temporal scalability is through the dyadic hierarchical B-picture prediction structure, by which the frame rate doubles with each more temporal layer. With 5 temporal layers, the corresponding frame rates are.875, 3.75, 7.5, 5 and 3 Hz. From a practical point of view, it will be interesting to see what are the optimal combination of the frame rate and uantization stepsize for different bit rates under this structure. To obtain the optimal solution under this scenario, for each given rate, we determine the uality values corresponding to all five possible frame rates using (8), and choose the frame rate (and its corresponding uantization stepsize using (7)) that leads to the highest uality. The results are shown in Fig.. Because the frame rate t can only increase in discrete steps, the optimal does not decrease monotonically with the rate. Rather, whenever t opt jumps to the next higher value (doubles), opt first increases to meet the rate constraint, and then decreases while t is held constant, as the rate increases. Consistent with the previous results in Fig. 9, for, the optimal frame rate transitions to 3 Hz at an intermediate bit rate; whereas for the other seuences, the optimal frame rate stays at 5 Hz even at the highest bit rates examined. As mentioned earlier, had we used a lower min, we would have seen transitions to 3 Hz after some rates. The results in Fig. can be validated by cross checking with Fig. 7. For example, for Crew, in the rate region below 25. kbps,.875 Hz leads to the highest uality, in the rate range between 25 and 6 kbps, 3.75 Hz gives the highest uality, between 6 and 253 kbps, 7.5 Hz is the best, and beyond 253 kbps, 5 Hz provides the highest uality. Connecting the top segments for each seuence in Fig. 7 will lead to the curve in Fig.. In practice, the SVC encoder with CGS uality scalability does not allow the uantization stepsize to change continuously. The finest granularity in uality scalability is a decrement of QP by with each additional uality layer. This means that the uantization stepsize reduces by a factor of 2 /6 with each additional layer. In practice, much coarser granularity is typically used, with decrement of QP by 2 to 4 typically. When we constrain to take only discrete values corresponding to

9 such QP values, in addition to allow only dyadic frame rates, one cannot always meet a rate constraint exactly. One can still solve the optimal t and for any given rate constraint using the proposed models, by exhaustive search within the finite set of feasible values for t and. V. CONCLUDING REMARKS In this paper we examine the impact of frame rate t and uantization stepsize on the rate and perceptual uality of scalable video. Both models are developed based on the key observation from experimental data that the relative reduction of either rate and uality when the frame rate decreases is uite independent of the uantization stepsize. This observation enables us to express both rate and uality as the product of a function of and a function of t. The proposed rate and uality models are analytically tractable, each reuiring only two content-dependent parameters. The rate model fits the measured rates very accurately, with an average Pearson correlation of.998, over four video seuences. The uality model also match the MOS from subjective tests very well, with an average Pearson correlation of.984. We further apply these models for rate-constrained SVC bitstream adaptation, where the problem is to determine the frame rate and uantization stepsize that can lead to the highest perceptual uality for a given target rate. We derive the optimal frame rate t opt and uantization stepsize opt, both as a function of the rate R, first by assuming t can vary continuously to provide theoretical insights, and then by considering the feasible set of discrete frame rates afforded by the hierarchical temporal prediction structure. The proposed rate and uality models have other applications beyond SVC bit stream adaptation. One important application is in non-scalable encoder optimization, e.g., determining the optimal encoding frame rate for a target bit rate. It can also be used for scalable encoder optimization, e.g., determining the appropriate temporal and amplitude layers to generate and include at different rate ranges. For the proposed models to be adopted for practical applications, one must be able to determine the model parameters easily either from the original or coded seuences. For the rate model, the parameters for each seuence can be easily derived from the actual bit rates of layers corresponding to selected combinations of and t, once a complete scalable bit stream is created. However, to use the the rate model for encoder optimization, it is desirable to determine the model parameters from some content features (such as motion, contrast, etc.) derived from the original video. Similarly for the uality model, we will investigate the correlation between the model parameters and content features. Both will be subjects of our future study. We will further investigate how to take into account of the viewer s sensitivities to jerkiness and coding artifacts, when determining the uality model parameters for individual viewers. ACKNOWLEDGMENT This material is based upon work supported in part by the National Science Foundation under Grant No REFERENCES [] G. Sullivan, T. Wiegand, and H. Schwarz, Text of ITU-T Rec. H.264 ISO/IEC 4496-:2X/DCOR / Amd.3 Scalable video coding, ISO/IEC JTC/SC29/WG, MPEG8/N9574, Antalya, TR, January 28. [2] Y.-K. Wang, M. Hannuksela, S. Pateux, A. Eleftheriadis, and S. Wenger, System and transport interface SVC, IEEE Trans. Circuit and Sys. for Video Technology, vol. 7, no. 9, pp , Sept. 27. [3] M. Wien, H. Schwarz, and T. Oelbaum, Performance analysis of SVC, IEEE Trans. Circuit and Sys. for Video Technology, vol. 7, no. 9, pp , Sept. 27. [4] H.264/AVC, Draft ITU-T Rec. and Final Draft Intl. Std. of Joint Video Spec. (ITU-T Rec. H.264\ISO/IEC AVC) Joint Video Team (JVT), Joint Video Team, Doc. JVT-G5, Mar. 23. [5] E. Ong, X. Yang, W. Lin, Z. Lu, and S. Yao, Perceptual Quality Metric For Compressed Videos, in Proc. of ICASSP, vol. 2, Mar. 25, pp [6] R. Feghali, D. Wang, F. Speranza, and A. Vincent, Quality metric for video seuences with temporal scalability, in Proc. of ICIP, vol. 3, Sep. 25, pp. III [7] H.-T. Quan and M. Ghanbari, Temporal Aspect of Perceived Quality of Mobile Video Broadcasting, IEEE Trans. on Broadcasting, vol. 54, no. 3, pp , Sept. 28. [8] R. Feghali, D. Wang, F. Speranza, and A. Vincent, Video uality metric for bit rate control via joint adjustment of uantization and frame rate, IEEE Trans. on Broadcasting, vol. 53, no., pp , Mar. 27. [9] W. Ding and B. Liu, Rate control of MPEG video coding and recoding by rate-uantization modeling, IEEE Trans. Circuit and Sys. for Video Technology, vol. 6, pp. 2 2, Feb [] T. Chiang and Y.-Q. Zhang, A new rate control scheme using uadratic rate distortion model, IEEE Trans. Circuit and Sys. for Video Technology, vol. 7, no. 2, pp , Feb [] T. Chiang, H.-J. Lee, and H. Sun, An overview of the encoding tools in the MPEG-4 reference software, in Proc. of IEEE Intl. Symp. Circuit and Systems, Geneva, Switzerland, May [2] J. Ribas-Corbera and S. Lei, Rate control in DCT video coding for low-delay communications, IEEE Trans. Circuit and Sys. for Video Technology, vol. 9, no. 2, pp , Feb [3] Z. He and S. K. Mitra, A novel linear source model and a unified rate control algorithm for H.264/MPEG-2/MPEG-4, in Proc. of Intl. Conf. Acoustics, Speech, and Signal Processing, Salt Lake City, Utah, May 2. [4] JSVM software, Joint Scalable Video Model, Joint Video Team, Doc. JVT-X23, Geneva, Switzerland, 29 June - 5 July 27. [5] S. H. Jin, C. S. Kim, D. J. Seo, and Y. M. Ro, Quality Measurement Modeling on Scalable Video Applications, in Proc. of IEEE Workshop on Multimedia Signal Processing, Otc. 27, pp [6] Y.-F. Ou, Z. Ma, and Y. Wang, A novel uality metric for compressed video considering both frame rate and uantization artifacts, in Proc. of Intl. Workshop Video Processing and Quality Metrics for Consumer (VPQM), Scottsdale, AZ, Jan. 29. [7] ITU-T Rec. P. 9: Subjective video uality assessment methods for multimedia applications, 999. [8] S. Wolf and M. Pinson, Video uality measurement techniues, NTIA, Tech. Report 2-392, June 22. [9] Y.-F. Ou, T. Liu, Z. Zhao, Z. Ma, and Y. Wang, Modeling the impact of frame rate on perceptual uality of video, in Proc. of Intl. Conf. Image Processing (ICIP), San Diego, CA, Oct. 28, pp

One-pass bitrate control for MPEG-4 Scalable Video Coding using ρ-domain

Author manuscript, published in "International Symposium on Broadband Multimedia Systems and Broadcasting, Bilbao : Spain (2009)" One-pass bitrate control for MPEG-4 Scalable Video Coding using ρ-domain