Mobile Video Quality Assessment Database Anush Krishna Moorthy, Lark Kwon Choi, Alan Conrad Bovik and Gustavo de Veciana Department of Electrical & Computer Engineering, The University of Texas at Austin, Austin, TX 78712-1084, USA. Abstract We introduce a new research tool the LIVE Mobile Video Quality Assessment (VQA) database. The database consists of pristine reference and distorted videos, along with human/subjective opinion scores of the associated video quality. The database was designed towards improving our understanding of human judgements of time-varying video quality in heavilytrafficked wireless networks. A byproduct of a better understanding could be quantitative models useful for the development of perceptually-aware algorithms for resource allocation and rate adaptation for video streaming. The database consists of 200 distorted videos created from 10 RAW HD videos acquired using a RED ONE digital cinematographic camera. It includes static distortions such as compression and wireless packet loss, as well as dynamically varying distortions. We describe the creation of the database, simulated distortions and the human study that we conducted to obtain 5,300 time-sampled subjective traces of quality and summary subjective scores. We analyze the results obtained for certain subclasses of distortions from this large database of interest in context of wireless video delivery. The LIVE Mobile VQA database, including the human subjective scores, will be made available to researchers in the field at no cost in order to aid the development of novel strategies for videoaware resource allocation. I. INTRODUCTION According to the Cisco Visual Networking Index (VNI) forecast [1], global mobile traffic nearly tripled in 2010 for the third consecutive year, and nearly 50% of this traffic was accounted for by mobile data, with this number predicted to increase to more than 75% by 2015. Associated with this explosion in mobile video streaming is paucity of bandwidth that is already evident [2] and is only predicted to get worse [3]. In this environment, developing frameworks for efficient resource allocation when transmitting video is a topic of relevant practical interest, and a field of intense study. A promising direction of research is the perceptual optimization of wireless video networks, wherein network resource allocation protocols are designed to provide video experiences that are measurably improved under perceptual models. Since the final receivers of videos transported over wireless networks are humans, it is imperative that perceptual models for resource optimization capture human opinion on visual quality. Here, we summarize some results from a large scale human (subjective) study that we recently conducted to gauge subjective opinion on HD videos when displayed on mobile devices. There exist several subjective studies for video quality assessment (VQA) [4] [7], however, these studies have been performed on large screen displays and the distortions have included compression and/or transmission over networks. Studies on mobile displays include that in [8], which evaluated the quality of the H.264 scalable video codec (SVC), that in [9], which evaluated image resolution requirements for MobileTV, as well as those in [10] [13]. While each of the above databases and studies are valuable, almost all of them suffer from one or more of the following problems: (1) small, insignificant database size, (2) insufficient distortion separation for judgements on perceptual quality, (3) unknown sources (not necessarily uncompressed), with unknown source distortions, (4) low video resolutions which are not relevant in today s world, and (5) lack of public availability of the database. In order to provide the research community with a modern and adequate resource enabling suitable modeling of human subjective opinion on video quality, we have created a large database of broad utility. The LIVE Mobile VQA database consists of 200 distorted videos at HD (720p) resolution, and provides human opinion on subjective quality obtained from analyzing responses from over 50 subjects resulting in 5,300 summary subjective scores and time-sampled subjective traces of quality. The study was conducted on a small mobile screen (4 ) as well as a larger tablet screen (10.1 ) and encompasses a wide variety of distortions, including compression and wireless packet loss. More importantly, the database is the first of its kind to include dynamic distortions, i.e., distortions that vary as a function of time when the subject views a video, in order to simulate the scenarios with variable bit-rate delivery. In this article, we summarize certain key aspects of the database construction, the human study conducted and describe significant results obtained that are relevant to researchers in the wireless video space. II. SUBJECTIVE ASSESSMENT OF MOBILE VIDEO QUALITY A. Source Videos and Distortion Simulation Source videos were obtained using a RED ONE digital cinematographic camera, and the 12-bit REDCODE RAW data was captured at a resolution of 2K(2048 1152) at frame rates of 30 fps and 60 fps using the REDCODE 42MB/s option to ensure the best possible acquisition quality. Source videos were first truncated to 15 seconds and then downsampled to a resolution of 1280 720 (720p) at 30 fps and converted into uncompressed.yuv files. A total of 12 videos from a larger set were used for the study, two of which were used for training the subject, while the rest were used in the actual study. Figure 1 shows sample frames from some of the video sequences.
Fig. 1. Example frames of the videos used in the study. Fig. 2. Rate Adaptation: Schematic diagram of the three different rateswitches in a video stream simulated in this study. Each of the reference videos were subjected to a variety of distortions including: (a) compression, (b) wireless channel packet-loss, (c) frame-freezes, (d) rate adaptation and (e) temporal dynamics. Each source video was compressed using the JM reference implementation of the H.264 scalable video codec (SVC) [] at four rates R1, R2, R3, R4 where, R1 < R2 < R3 < R4 between 0.7 Mbps and 6 Mbps using fixed QP encoding, yielding 40 distorted videos. The QP values (and hence the bit-rates) were selected manually for each video to ensure perceptual separation, so that humans (and algorithms alike) are capable of producing judgements of visual quality [4], [14]. Wireless packet loss was simulated using a Rayleigh fading channel over which each compressed and packetized video was transmitted, which lead to a total of 40 distorted videos. Four frame-freeze conditions were simulated for each source video, yielding a total of 40 distorted videos. To investigate if humans are more sensitive to changes in distortion levels rather than the absolute level of the distortion similar to the behavior seen in psychovisual studies [15] we also simulated rate-changes as a function of time as the subject views a particular video. Specifically, the subject starts viewing the video at a rate RX, then after n seconds switches to a higher rate RY, then again after n seconds switches back to the original rate RX, as illustrated in Fig. 2. We simulated three different rate switches, where RX = R1, R2 and R3 and RY = R4. Although the duration n is another potential variable affecting human quality of experience, because of the length of the test sessions, we fixed n = 5 sec., which along with the three rate switches yielded a total of 30 rate-adapted distorted videos. A temporal rate (and thus quality) dynamics condition was Fig. 3. Temporal Dynamics: Schematic illustration of two rate changes across the video; the average rate remains the same in both cases. Left: Multiple changes and Right: Single rate change. Note that we have already simulated the single rate-change condition as illustrated in Fig. 2, hence we ensure that the average bit-rate is the same for these two cases. Fig. 4. Temporal Dynamics: Schematic illustration of rate-changes scenarios. The average rate remains the same in all cases and is the same as in Fig. 3. The first row steps to rate R2 and then steps to a higher/lower rate, while the second row steps to R3 and then back up/down again simulated to evaluate the effect of multiple rate-switches (as against the single switch in the previous condition). The rate was varied between R1 to R4 multiple times (3) as illustrated in Fig. 3. We also simulated a set of distorted videos which evaluated the effect of the abruptness of the switch, i.e., instead of switching directly between R1 and R4, the rate was first switched to an intermediate level RZ from the current level and then to the other extreme. We simulated the following rate-switches: (1) R1 R2 R4, (2) R1 R3 R4, (3) R4 R2 R1 and (4) R4 R3 R1, as illustrated in Fig. 4. The average bit-rate across the temporal dynamics and the rate adaptation condition remained the same, to enable an objective comparison across these conditions. This yielded a total of 50 distorted videos. In summary, the LIVE Mobile VQA database consists of 10
reference videos and 200 distorted videos (4 compression + 4 wireless packet-loss + 4 frame-freezes + 3 rate-adapted + 5 temporal dynamics per reference), each of resolution 1280 720 at a frame rate of 30 fps, and of duration 15 seconds each. B. Test Methodology A single-stimulus continuous quality evaluation (SSCQE) study [16] with hidden reference [4], [14], [17] was conducted over a period of three weeks at The University of Texas at Austin, LIVE subjective testing lab, where the subjects viewed the videos on the Motorola Atrix, which has a 4 screen with a resolution of 960 540, using software that was specially created for the Android platform to display the videos. The subjective study was conducted at The University of Texas at Austin (UT) and involved mostly (naive) undergraduate students whose average age was between 22-28 years. Following our philosophy of using a reasonably representative sampling of the visual population, no vision test was performed, although a verbal confirmation of soundness of (corrected) vision was obtained from the subject. Each subject attended two separate sessions as part of the study such that each session lasted less than 30 minutes, each of which consisted of the subject viewing (randomized) 55 videos (50 distorted + 5 reference); a short training set (6 videos) preceded the actual study. The videos were displayed on the center of the screen with an un-calibrated continuous bar at the bottom, which was controlled using the touchscreen. The subjects were asked to rate the videos as a function of time i.e., provide instantaneous ratings of the videos, as well as to provide an overall rating at the end of each video. At the end of each video a similar continuous bar was displayed on the screen, although it was calibrated as Bad, Fair, and Excellent by markings, equally spaced across the bar. Once the quality was entered, the subject was not allowed to change the score. The quality ratings were in the range 0-5. A total of thirty-six subjects participated in the mobile study and the design was such that each video received 18 subjective ratings. The instructions provided to the subject are reproduced in the Appendix. The subject rejection procedure in [16] was used to reject two subjects from the mobile study, and the remaining scores were averaged to form a Differential Mean Opinion Scores (DMOS) for each video [4], which is representative of the perceived quality of the video. DMOS was computed only for the overall scores that the subject assigned to the videos. The average standard error in the DMOS score was 0.2577 across the 200 distorted videos. We assume that the DMOS scores sample a Gaussian distribution centered around the DMOS having a standard deviation computed from the differential opinion scores across subjects for all further analysis. C. Evaluation of Subjective Opinion For each of the temporal distortion classes, we conducted a t-test between the Gaussian distributions centered at the DMOS values (and having an associated, known standard deviation) of the conditions we are interested in comparing at the 95% confidence level. Since the conditions being compared are functions of content, we compared each of the 10 reference contents separately for each pair of conditions. In the tables that follow, a value of 1 indicates that the row-condition is statistically superior to the column-condition, while a 0 indicates that the row is worse than a column. The results from the statistical analysis are tabulated in Tables I - V. Due to the dense nature of the content, we summarize the results in the following paragraphs. Note that the text only provides a high level description of the results in the table, the reader is advised to thoroughly study the table in order to better understand the results. a) Compression (Table I): This table confirms that the distorted videos were perceptually separable. Notice that each compression rate is statistically better (perceptually) than the next lower rate over all content used in the study. b) Rate Adaptation (Tables II, III): Our results indicate that it is preferable to switch from a low rate to a higher one and back, if the duration at the higher rate is at least half as much as the duration of the lower rate. This is contrary to common wisdom that people prefer not to see fluctuations in video quality, given the alternative of staying at the lower rate. Further, if the rates are perceptually separated (as our rates are), a change in the lowest rate has a definite impact on the visual quality. c) Temporal Dynamics (Tables IV, V): An interesting observation from the results is that users prefer multiple rate switches over fewer switches. Again, while this may be contrary to conventional wisdom, there seems to be a plausible explanation for such behavior. When shown a high quality segment of a video for a long duration, the subject acclimatizes to the viewing quality, raising the bar for acceptance so that when the high quality segments are followed by long low quality segments, she/he assigns a higher penalty than on videos which contain short segments of higher quality. A long low quality segment preceded by a long high quality one evokes a negative response. It may be conjectured that such videos are seen as an attempt to improve the viewing experience, thereby boosting the overall perception of quality. We note that our results are conditioned on the degree of separation between the quality levels as well as the duration of each segment, and may not generalize to other such switches between quality levels at lower separation and with faster/slower segment duration. The tables also indicate that switching to an intermediate rate before switching to a higher/lower rate is preferable easing a user into the new quality level is seemingly always better than simply jumping to the final quality level. It is also almost always true that the intermediate level should be closer to the highest quality level in the switch. Finally, the results also indicate that the quality of the end-segment has a definite impact on the overall perception, and ending on a higher quality segment is almost always preferable.
R 1 R 2 R 3 R 4 R 1 - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 R 2 1 1 1 1 1 1 1 1 1 1 - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 R 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 R 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 - - - - - - - - - - TABLE I RESULTS OF T-TEST BETWEEN THE VARIOUS COMPRESSION-RATES SIMULATED IN THE STUDY. EACH SUB-ENTRY IN EACH ROW/COLUMN CORRESPONDS TO THE 10 REFERENCE VIDEOS IN THE STUDY. R 1 R 4 R 1 R 2 R 4 R 2 R 3 R 4 R 3 R 1 R 4 R 1 - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 R 2 R 4 R 2 1 1 1 1 1 1 1 1 1 1 - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 R 3 R 4 R 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 - - - - - - - - - - TABLE II RESULTS OF T-TEST BETWEEN THE VARIOUS RATE-ADAPTED DISTORTED VIDEOS SIMULATED IN THE STUDY. EACH SUB-ENTRY IN EACH ROW/COLUMN CORRESPONDS TO THE 10 REFERENCE VIDEOS IN THE STUDY. R 1 R 2 R 3 R 4 R 1 R 4 R 1 1 1 1 1 1 1 1 1 1 1 0 0 0-0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 R 2 R 4 R 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 R 3 R 4 R 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1-0 1-0 1 0 0 0 0 0 0 0 0 0 0 0 TABLE III RESULTS OF T-TEST BETWEEN THE VARIOUS COMPRESSION-RATES AND THE RATE-ADAPTED VIDEOS SIMULATED IN THE STUDY. EACH SUB-ENTRY IN EACH ROW/COLUMN CORRESPONDS TO THE 10 REFERENCE VIDEOS IN THE STUDY. R 1 R 4 R 1 R 1 R 4 R 1 R 4 R 1 R 1 R 4 R 1 - - - - - - - - - - 0 - - - 0 0 0 0 1 - R 1 R 4 R 1 R 4 R 1 1 - - - 1 1 1 1 0 - - - - - - - - - - - TABLE IV RESULTS OF T-TEST BETWEEN MULTIPLE RATE SWITCHES AND A SINGLE RATE SWITCH. EACH SUB-ENTRY IN EACH ROW/COLUMN CORRESPONDS TO THE 10 REFERENCE VIDEOS IN THE STUDY. III. DISCUSSION AND CONCLUSION We described a new resource the LIVE Mobile VQA database consisting of HD videos, which incorporates a wide variety of distortions, along with the associated subjective opinion scores on visual quality. The distortions simulated include the the previously studied uniform compression and wireless packet loss and the novel dynamically-varying distortions. The large size of the study and the variety that it offers allows one to study and analyze human reactions to temporally varying distortions as well as to varying form factors from a wide variety of perspectives. It is obvious from the foregoing analysis that time-varying quality has a definite and quantifiable impact on human subjective opinion, and this opinion is a function of the duration of the changes, the quality levels and the order in the variations in quality occur. Our results sometimes contradict popularly held beliefs about video quality, however, these contradictions suggest provocative new conclusions. Humans are seemingly not as unforgiving as one may believe, and appear to reward attempts to improve quality. Rapid changes in quality levels as a function of time are perhaps perceived by humans as attempts to provide better quality and hence these kinds of temporal distortions yield higher quality scores compared with conditions in which long segments of low quality follow long segments of high quality. Due to limitations of the study session durations, we were unable to include several other interesting conditions, including greater number of ratechanges, multiple rate changes between different quality levels, a single change with a high quality segment at the end (eg., R 4 R 1 R4) and so on. Future work will address these relevant scenarios to better understand human perception of visual quality. In this article, we only summarized a relevant portion of the database. We detail the entire database and the analysis from all of the distortions, including an analysis of the temporal traces of subjective quality and objective algorithm performance in [18]. The cited article also contains a description and analysis of the same study conducted on a tablet screen and comparisons between subjective opinions as a function of the display device. We hope that the new LIVE mobile VQA database of 200 distorted videos and associated human opinion scores from over 50 subjects will provide fertile ground for years of future research. Given the sheer quantity of data, we believe that our foregoing analysis (and that detailed in [18]) is the tip of the ice-berg of discovery. We invite further analysis of the data towards understanding and producing better models of human
R 1 R 4 R 1 R 4 R 1 R 1 R 2 R 4 R 4 R 2 R 1 R 1 R 3 R 4 R 4 R 3 R 1 R 1 R 4 R 1 R 4 R 1 - - - - - - - - - - - 0 0 0 1 1 0 0 0 0 1 1-1 1 1 1 1 1 1 0 0 0 0-0 0 0 0 0 1 1 0 0 1 1 1 1 1 1 R 1 R 2 R 4-1 1 1 0 0 1 1 1 1 - - - - - - - - - - 1 1 1 1 1 1 1 1 1 1 0 0-0 0 0-0 0 0 1 1 1-1 1 1 1 1 1 R 4 R 2 R 1 0 0-0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 1-0 0 0 0-0 1 0 R 1 R 3 R 4 1 1 1 1-1 1 1 1 1 1 1-1 1 1-1 1 1 1 1 1 1 1 1 1 1 1 1 - - - - - - - - - - 1 1 1 1 1 1 1 1 1 1 R 4 R 3 R 1 0 0 1 1 0 0 0 0 0 0 0 0 0-0 0 0 0 0 0 0-1 1 1 1-1 0 1 0 0 0 0 0 0 0 0 0 0 - - - - - - - - - - TABLE V RESULTS OF T-TEST BETWEEN THE VARIOUS TEMPORAL-DYNAMICS DISTORTED VIDEOS SIMULATED IN THE STUDY. EACH SUB-ENTRY IN EACH ROW/COLUMN CORRESPONDS TO THE 10 REFERENCE VIDEOS IN THE STUDY. behavior when viewing videos on mobile platforms. APPENDIX INSTRUCTIONS TO THE SUBJECT You are taking part in a study to assess the quality of videos. You will be shown a video at the center of your screen and there will be a rating bar at the bottom, which can be controlled by using your fingers on the touchscreen. You are to provide the quality as function of time i.e., move the rating bar in real-time based on your instantaneous perception of quality. The extreme left on the bar is bad quality and the extreme right is excellent quality. At the end of the video you will be presented with a similar bar, this time calibrated as Bad, Poor and Excellent, from left-to-right. Using this bar, provide us with your opinion on the overall quality of the video. There is no right or wrong answer, we simply wish to gauge your opinion on the quality of the video that is shown to you. ACKNOWLEDGMENT This research was supported by the National Science Foundation under grant CCF-0728748 and by Intel and Cisco corporation under the VAWN program. REFERENCES [9] H. Knoche, J. McCarthy, and M. Sasse, Can small be beautiful?: assessing image resolution requirements for mobile tv, Proceedings of the 13th annual ACM international conference on Multimedia, pp. 829 838, 2005. [10] S. Jumisko-Pyykko and J. Hakkinen, Evaluation of subjective video quality of mobile devices, in Proceedings of the 13th annual ACM international conference on Multimedia. ACM, 2005, pp. 535 538. [11] M. Ries, O. Nemethova, and M. Rupp, Performance evaluation of mobile video quality estimators, in Proceedings of the European Signal Processing Conference,(Poznan, Poland. Citeseer, 2007. [12] S. Jumisko-Pyykko and M. Hannuksela, Does context matter in quality evaluation of mobile television? in Proceedings of the 10th international conference on Human computer interaction with mobile devices and services. ACM, 2008, pp. 63 72. [13] S. Winkler and F. Dufaux, Video quality evaluation for mobile applications, in Proc. of SPIE Conference on Visual Communications and Image Processing, Lugano, Switzerland, vol. 5150. Citeseer, 2003, pp. 593 603. [14] A. K. Moorthy, K. Seshadrinathan, R. Soundararajan, and A. C. Bovik, Wireless video quality assessment: A study of subjective scores and objective algorithms, IEEE Transactions on Circuits and Systems for Video Technology, vol. 20, no. 4, pp. 513 516, April 2010. [15] B. Wandell, Foundations of vision. Sinauer Associates, 1995. [16] BT-500-11: Methodology for the subjective assessment of the quality of television pictures, International Telecommuncation Union Std. [17] M. H. Pinson and S. Wolf, Comparing subjective video quality testing methodologies, Visual Communications and Image Processing, SPIE, vol. 5150, 2003. [18] A. K. Moorthy, L.-K. Choi, A. C. Bovik, and G. de Veciana, Video Quality Assessment on Mobile Devices: Subjective, Behavioral and Objective Studies, IEEE Journal of Selected Topics in Signal Processing, Special Issue on New Subjective and Objective Methodologies for Audio and Visual Signal Processing, 2011 (submitted). [1] CISCO Corp., Cisco Visual Networking Index: Global Mobile Data Traffic Forecast Update, 2010 2015, http://www.cisco.com/en/us/solutions/collateral/ns341/ns525 /ns537/ns705/ns827/white paper c11-520862.html, 2011. [2] PCWorld, Fcc warns of impending wireless spectrum shortage, http://www.pcworld.com/article/186434/ fcc warns of impending wireless spectrum shortage.html, 2010. [3] S. Higginbotham, Spectrum shortage will strike in 2013, http://gigaom.com/2010/02/17/analyst-spectrum-shortage-will-strike-in- 2013/, 2010. [4] K. Seshadrinathan, R. Soundararajan, A. C. Bovik, and L. K. Cormack, Study of subjective and objective quality assessment of video, IEEE Transactions on Image Processing, vol. 19, no. 2, pp. 1427 1441, 2010. [5] Video Quality Experts Group (VQEG), Final report from the video quality experts group on the validation of objective quality metrics for video quality assessment phase II, http://www.its.bldrdoc.gov/vqeg/projects/frtv phaseii, 2003. [6], Final report from the video quality experts group on the validation of objective quality metrics for video quality assessment phase I, http://www.its.bldrdoc.gov/vqeg/projects/frtv phasei, 2000. [7], Final report of video quality experts group multimedia phase I validation test, TD 923, ITU Study Group 9, 2008. [8] A. Eichhorn and P. Ni, Pick your layers wisely-a quality assessment of h. 264 scalable video coding for mobile devices, Proceedings of the 2009 IEEE international conference on Communications, pp. 5446 5451, 2009.