Perceptual Quality Prediction on Authentically Distorted Images Using a Bag of Features Approach

Size: px

Start display at page:

Download "Perceptual Quality Prediction on Authentically Distorted Images Using a Bag of Features Approach"

Irene Tucker
6 years ago
Views:

1 Journal of Vision (26) Perceptual Quality Prediction on Authentically Distorted Images Using a Bag of Features Approach Deepti Ghadiyaram Department of Computer Science University of Texas at Austin, Austin, TX, USA Alan C. Bovik Department of Electrical and Computer Engineering University of Texas at Austin, Austin, TX, USA Current top-performing blind perceptual image quality prediction models are generally trained on legacy databases of human quality opinion scores on synthetically distorted images. Therefore they learn image features that effectively predict human visual quality judgments of inauthentic, and usually isolated (single) distortions. However, real-world images usually contain complex, composite mixtures of multiple distortions. We study the perceptually relevant natural scene statistics of such authentically distorted images, in different color spaces and transform domains. We propose a bag of feature-maps approach which avoids assumptions about the type of distortion(s) contained in an image, focusing instead on capturing consistencies, or departures therefrom, of the statistics of real world images. Using a large database of authentically distorted images, human opinions of them, and bags of features computed on them, we train a regressor to conduct image quality prediction. We demonstrate the competence of the features towards improving automatic perceptual quality prediction by testing a learned algorithm using them on a benchmark legacy database as well as on a newly introduced distortion-realistic resource called the LIVE In the Wild Image Quality Challenge Database. We extensively evaluate the perceptual quality prediction model and algorithm and show that it is able to achieve good quality prediction power that is better than other leading models. Keywords: Perceptual image quality, natural scene statistics, blind image quality assessment, color image quality assessment. Introduction Objective blind or no-reference (NR) image quality assessment (IQA) is a fundamental problem of vision science with significant implications for a wide variety of image engineering applications. The goal of an NR IQA algorithm is the following: given an image (possibly distorted) and no other additional information, automatically and accurately predict its level of visual quality as would be reported by an average human subject. Given the tremendous surge in visual media content crossing the Internet and the ubiquitous availability of portable image capture (mobile) devices, an increasingly knowledgeable base of consumer users are demanding better quality images and video acquisition and display services. The desire to be able to control and monitor the quality of images produced has encouraged the rapid development of NR IQA algorithms, which can be used to monitor wired and wireless multimedia services, where reference images are unavailable. They can also be used to improve the perceptual quality of visual signals by employing qualitycentric processing, or to improve the perceptual quality of acquired visual signals by perceptually optimizing the capture process. Such

2 Journal of Vision (26) Ghadiyaram & Bovik 2 quality-aware strategies could help deliver the highest possible quality picture content to camera users. Authentic vs. Inauthentic Distortions Current IQA models have been designed, trained, and evaluated on benchmark human opinion databases such as the LIVE Image Quality Database (Sheikh, Sabir, & Bovik, 26), the TID Databases (Ponomarenko et al., 29) (Ponomarenko et al., 23), the CSIQ Database (Larson & Chandler, 2), and a few other small databases (Callet & Autrusseau, 25). All of these databases have been developed beginning with a small set of high-quality pristine images (29 distinct image contents in (Sheikh et al., 26) and 25 in (Ponomarenko et al., 29) (Ponomarenko et al., 23)), which are subsequently distorted. The distortions are introduced in a controlled manner by the database architects and these distortion databases have three key properties. First, the distortion severities / parameter settings are carefully (but artificially) selected, typically for psychometric reasons, such as mandating a wide range of distortions, or dictating an observed degree of perceptual separation between images distorted by the same process. Second, these distortions are introduced by computing them from an idealized distortion model. Third, the pristine images are of very high quality, and are usually distorted by one of several single distortions. These databases therefore contain images that have been impaired by one of a few synthetically introduced distortion types, at a level of perceptual distortion chosen by image quality scientists. Existing legacy image quality databases have played an important role in the advancement of the field of image quality prediction, especially in the design of both distortion-specific and general-purpose full-reference, reduced-reference, and no-reference image quality prediction algorithms. However, the images in these kinds of databases are generally inauthentically distorted. Image distortions digitally created and manipulated by a database designer for the purpose of ensuring a statistically significant set of human responses are not the same as the real-world distortions that are introduced during image capture by the many and diverse types of cameras found in the hands of real-world users. We refer to such images as authentically distorted. Some important characteristics of real-world, authentically distorted images captured by naïve users of consumer camera devices is that the pictures obtained generally cannot be accurately described by a simple generative model, nor as suffering from single, statistically separable distortions. For example, a picture captured using a mobile camera under low-light conditions is likely to be under-exposed, in addition to being afflicted by lowlight noise and blur. Subsequent processes of saving and/or transmitting the picture over a wireless channel may introduce further compression and transmission artifacts. Further, the characteristics of the overall distortion load of an image will depend on the device used for capture and on the camera-handling behavior of the user, which may induce further nonlinear shake and blur distortions. Consumer-grade digital cameras differ widely in their lens configurations, levels of noise sensitivity and acquisition speed, and in postacquisition in-camera processing. Camera users differ in their shot selection preferences, hand steadiness, and situational awareness. Overall, our understanding of true, authentic image distortions is quite murky. Such complex, unpredictable, and currently unmodeled mixtures of distortions are characteristic of real-world pictures that are authentically distorted. There currently is not any known way to categorize, characterize, or model such complex and uncontrolled distortion mixtures, and it is certainly unreasonable to expect an image quality scientist to be able to excogitate a protocol for creating authentically distorted images in the laboratory, by synthetically combining controlled, programmed distortions into what must ultimately be regarded as authentically distorted images. There is a way to create databases of authentically distorted images, which is by acquiring images taken by many casual camera users. Normally, inexpert camera users will acquire pictures under highly varied and often suboptimal illuminations conditions, with unsteady hands, and with unpredictable behavior on the part of the photographic subjects. Such real-world, authentically distorted images exhibit a broad spectrum of authentic quality types, mixtures, and distortion severities, that defy attempts at accurate modeling or precise description. Authentic mixtures of distortions are even more difficult to model when they interact, creating new agglomerated distortions not resembling any of the constituent distortions. A simple example would be a noisy image that is heavily compressed, where the noise presence heavily affects the quantization process at high frequencies, yielding hard-to-describe, visible compressed noise artifacts. Users of mobile cameras will be familiar with this kind of spatially-varying, hard-to-describe distortion amalgamation. With an overarching goal to design an efficient blind IQA model that operates on images afflicted by real distortions, we created

3 Journal of Vision (26) Ghadiyaram & Bovik 3 a challenging blind image quality database called the LIVE In the Wild Image Quality Challenge Database. This new database contains images that were captured using a large number of highly diverse individual mobile devices, including tablets and smartphones to acquire typical real scenes in the U.S and Korea. These images are affected by unknown mixtures of generally occurring multiple interacting authentic distortions of diverse severities (Ghadiyaram & Bovik, Nov 24, 26). A byproduct of the characteristically different natures of authentically distorted images contained in the LIVE Challenge Database is that the statistical assumptions made in the past regarding distorted images do not hold. For example, statistics-based natural scene models, which are highly regular descriptors of natural high-quality images, are commonly modified to account for distortions using generalized statistical models. We have found that these models lose their power to discriminate high-quality images from distorted images, when the distortions are authentic (Figure 2). This is of great consequence for the burgeoning field of blind IQA model development, where the quality-aware statistical features used by top-performing no-reference (blind) image quality prediction models such as BRISQUE (Mittal, Moorthy, & Bovik, 22), BLIINDS (Saad, Bovik, & Charrier, 22), DIIVINE (Moorthy & Bovik, 2), and Tang et. al in (Tang, Joshi, & Kapoor, 2) (Tang, Joshi, & Kapoor, 24) are highly successful on the legacy IQA databases (Sheikh et al., 26; Ponomarenko et al., 29). However, as we will show, the reliability of these statistical features, and consequently the performances of these blind IQA models suffer when applied to authentically distorted images contained in the new LIVE Challenge Database. We believe that further development of successful no-reference image and video quality models will greatly benefit by the development of authentic distortion databases, making for more meaningful and relevant performance analyses and comparisons of stateof-the-art IQA algorithms, as well as furthering efforts towards improving our understanding of the perception of picture distortions and building better and more robust IQA models. Here, we aim to produce as rich a set of perceptually relevant quality-aware features as might better enable the accurate prediction of subjective image quality judgments on images afflicted by complex, real-world distortions. Our ultimate goal is to design a more robust and generic quality predictor that performs well not only on the existing legacy IQA databases (such as the LIVE IQA Database (Sheikh et al., 26) and TID 23 Database (Ponomarenko et al., 23)) containing images afflicted only by single, synthetic distortions, but also delivers superior quality prediction performance on real-world images in the wild, i.e., as encountered in consumer image capture devices. Motivation behind using Natural Scene Statistics Natural scene statistics: Current efficient NR IQA algorithms use natural scene statistics (NSS) models (Bovik, 23) to capture the statistical naturalness (or lack thereof) of images that are not distorted. It should be noted that natural images are not necessarily images of natural environments such as trees or skies. Any natural visible-light image that is captured by an optical camera and is not subjected to artificial processing on a computer is regarded here as a natural image including photographs of man-made objects. NSS models rely on the fact that good quality real-world photographic images (henceforth referred to as pristine images) that have been suitably normalized follow statistical laws. Current NR IQA models measure perturbations of these statistics to predict image distortions. State-of-the-art NSS-based NR IQA models (Mittal, Moorthy, & Bovik, 22; Saad et al., 22; Moorthy & Bovik, 2, 2; Tang et al., 2; Zhang, Moorthy, Chandler, & Bovik, 24) (Mittal, Soundararajan, & Bovik, 22) exploit these statistical perturbations by first extracting image features in a normalized bandpass space, then learning a kernel function that maps these features to ground truth subjective quality scores. To date, these feature representations have only been tested on images containing synthetically-applied distortions, and may not perform well when applied on real world images afflicted by mixtures of authentic distortions (Table 2). Consider the images in Fig., where images were transformed by a bandpass debiasing and divisive normalization operation (Ruderman, 994). This normalization process reduces spatial dependencies in natural images. The empirical probability density function (histogram) of the resulting normalized luminance coefficient (NLC) map of the pristine image in Fig. is quite Gaussianlike (Fig. 2). We deployed a generalized Gaussian distribution (GGD) model and estimated its parameters - shape (α) and variance (σ 2 ) (see below sections for more details). We found that the value of α for Fig. is 2.9, in accordance with the Gaussian model of

Journal of Vision (26) Ghadiyaram & Bovik 4 (c) (d) Figure : A pristine image from the legacy LIVE Image Quality Database (Sheikh et al., 26) JP2K compression distortion artificially added to.

4 Journal of Vision (26) Ghadiyaram & Bovik 4 (c) (d) Figure : A pristine image from the legacy LIVE Image Quality Database (Sheikh et al., 26) JP2K compression distortion artificially added to. (c) White noise added to. (d) A blurry image also distorted with low-light noise from the new LIVE In the Wild Image Quality Challenge Database (Ghadiyaram & Bovik, Nov 24, 26). Number of coefficients Fig. 2 Fig. 2 Fig. 2(c) Fig. 2(d) Normalized coefficients Figure 2: Histogram of normalized luminance coefficients of the images in Figures - (d). Notice how each single, unmixed distortion affects the statistics in a characteristic way, but when mixtures of authentic distortions afflict an image, the histogram resembles that of a pristine image. (Best viewed in color). the histogram of its NLC map. It should be noted that the family of generalized Gaussian distributions include the normal distribution when α = 2 and the Laplacian distribution when α =. This property is not specific to Fig, but is generally characteristic of all natural images. As first observed in (Ruderman, 994) natural, undistorted images of quite general (well-lit) image content captured by any good quality camera may be expected to exhibit this statistical regularity after processing by applying bandpass debiasing and divisive normalization operations. To further illustrate this well-studied phenomenal regularity, we processed 29 pristine images from the legacy LIVE IQA Database (Sheikh et al., 26) which vary greatly in their image content and plotted the collective histogram of the normalized coefficients of all 29 images in Figure 3. Specifically, we concatenated the normalized coefficients of all the images into a single vector and plotted its histogram. The best-fitting GGD model yielded α = 2.5, which is again nearly Gaussian. The singular spike at zero almost invariably arises from cloudless sky entirely bereft of objects. The same property is not held by the distorted images shown in Fig. and (c). The estimated shape parameter values computed on those images was.2 and 3.2 respectively. This deviation from Gaussianity of images containing single distortions has been observed and established in numerous studies on large comprehensive datasets of distorted images, irrespective of the image content. Quantifying these kinds of statistical deviations as learned from databases of annotated distorted images is the underlying principle

5 Journal of Vision (26) Ghadiyaram & Bovik 5.9 Number of coefficients Normalized coefficients Figure 3: Histogram of normalized luminance coefficients of all 29 pristine images contained in the legacy LIVE IQA Database (Sheikh et al., 26). Notice how irrespective of the wide-variety of image content of the 29 pristine images, their collective normalized coefficients follow a Gaussian distribution (Estimated GGD shape parameter = 2.5.) behind several state-of-the-art objective blind IQA models (Mittal, Moorthy, & Bovik, 22; Saad et al., 22; Moorthy & Bovik, 2; Tang et al., 2; Zhang et al., 24; Mittal, Soundararajan, & Bovik, 22). While this sample anecdotal evidence suggests that the statistical deviations of distorted images may be reliably modeled, consider Fig. (d), from the new LIVE In the Wild Image Quality Challenge Database (Ghadiyaram & Bovik, Nov 24, 26). This image contains an apparent mixture of blur, sensor noise, illumination, and possibly other distortions, all nonlinear and difficult to model. Some distortion arises from compositions of these, which are harder to understand or model. The empirical distribution of its NLC (Fig. 2) also follows a Gaussian-like distribution and the estimated shape parameter value (α) is 2.2, despite the presence of multiple severe and interacting distortions. As a way of visualizing this problem, we show scatter plots of subjective quality scores against the α values of the best GGD fits to NLC maps of all the images (including the pristine images) in the legacy LIVE IQA Database (of synthetically distorted pictures) (Sheikh et al., 26) in Fig. 4 and for all the authentically distorted images in the LIVE Challenge Database in Fig. 4. From Fig. 4, it can be seen that most of the images in the LIVE legacy IQA Database that have high human subjective quality scores (i.e., low Difference of Mean Opinion Scores (DMOS)) associated with them (including the pristine images) have estimated α values close to 2., while pictures having low quality scores (i.e., high DMOS), take different α values, thus are statistically distinguishable from high-quality images. However, Fig. 4 shows that authentically distorted images from the new LIVE Challenge Database may be associated with α values close to 2., even on heavily distorted pictures (i.e., with low Mean Opinion Scores (MOS)). Figure 5 plots the distribution of the fraction of all the images in the database that fall into four discrete MOS and DMOS categories. It should be noted that legacy LIVE IQA Database provides DMOS scores while the LIVE Challenge Database contains MOS scores. These histograms show that the distorted images span the entire quality range in both databases and that there is no noticeable skew of distortion severity in either databases that could have affected the results in Fig. 4 and Fig. 6. Figure 6 also illustrates our observation that authentic and inauthentic distortions affect scene statistics differently. In the case of single inauthentic distortions, it may be observed that pristine and distorted images occupy different regions of this parameter space. For example, images with lower DMOS (higher quality) are more separated from the distorted image collection in this parameter space, making it easier to predict their quality. There is a great degree of overlap in the parameter space among images belonging to the categories DMOS <= 25 and DMOS > 25 and <= 5, while heavily distorted pictures belonging to the other two DMOS categories are separated in the parameter space. On the other hand, all the images from the LIVE Challenge Database, which contain authentic, often agglomerated distortions overlap to a great extent in this parameter space despite the wide spread of their quality distributions. Although the above visualizations in Figs. 4 and 6 were performed in a lower-dimensional space of parameters, it is possible that

6 Journal of Vision (26) Ghadiyaram & Bovik 6 Shape parameter of the GGD fit Shape parameter of the GGD fit DMOS, lower value indicates better quality MOS, higher value indicates better quality Figure 4: 2D scatter plots of subjective quality scores against estimated shape parameters (α) obtained by fitting a generalized Gaussian distribution to the histograms of normalized luminance coefficients (NLC) of all the images in the legacy LIVE Database (Sheikh et al., 26) and the LIVE Challenge Database (Ghadiyaram & Bovik, Nov 24, 26). authentically distorted images could exhibit better separation if modeled in a higher dimensional space of perceptually relevant features. It is clear, however that mixtures of authentic distortions may affect the statistics of images distorted by single, synthetic distortions quite differently. Figures 4 and 6 also suggest that although the distortion-informative image features used in several state-of-the-art IQA models are highly predictive of the perceived quality of inauthentically distorted images contained in legacy databases (Sheikh et al., 26; Ponomarenko et al., 29) (Table 6), these features are insufficient to produce accurate predictions of quality on real-world authentically distorted images (Table 2). These observations highlight the need to capture other, more diverse statistical image features towards improving the quality prediction power of blind IQA models on authentically distorted images. Our Contributions and their Relation to Human Vision To tackle the difficult problem of quality assessment of images in the wild, we sought to produce a large and comprehensive collection of quality-sensitive statistical image features drawn from among the most successful NR IQA models that have been produced to date (Xu, Lin, & Kuo, 25). However, going beyond this, and recognizing that even top-performing algorithms can lose their predictive power on real-world images afflicted by possibly multiple authentic distortions, we also designed a number of statistical features implied by existing models yet heretofore unused, in hopes that they might supply additional discriminative power on authentic image distortion ensembles. Even further, we deployed these models in a variety of color spaces representative of both chromatic image sensing and bandwidth-efficient and perceptually motivated color processing. This large collection of features defined in various complementary perceptually relevant color and transform-domain spaces drives our feature maps based approach. Given the availability of a sizeable corpus of authentically distorted images with a very large database of associated human quality judgments, we saw the opportunity to conduct a meaningful, generalized comparative analysis of the quality prediction power of modern quality-aware statistical image features defined over diverse transformed domains and color spaces. We thus conducted a discriminant analysis of an initial set of 56 features designed in different color spaces, which, when used to train a regressor, produced an NR IQA model delivering a high level of quality prediction power. We also conducted extensive experiments to validate the proposed model against other top-performing NR IQA models, using both the standard benchmark dataset (Sheikh et al., 26) as well as the new A preliminary version of this work appeared in SPIE (Ghadiyaram & Bovik, 25)

7 Journal of Vision (26) Ghadiyaram & Bovik DMOS <=25 DMOS > 25 and <= 5 DMOS > 5 and <= 75 DMOS > 75 MOS <=25 MOS > 25 and <= 5 MOS > 5 and <= 75 MOS > 75 Figure 5: Bar plots illustrating the distribution of the fraction of images from (Left) the legacy LIVE IQA Database and (Right) the LIVE Challenge Database belonging to 4 different DMOS and MOS categories respectively. These histograms demonstrate that the distorted images span the entire quality range in both the databases. LIVE In the Wild Image Quality Challenge Database (Ghadiyaram & Bovik, Nov 24, 26). We found that all prior state-of-the-art NR IQA algorithms perform rather poorly on the LIVE Challenge Database, while our perceptually-motivated feature-driven model yielded good prediction performance. We note that we could compare only with those algorithms whose code was publicly available. These results underscore the need for more representative quality-aware NSS features that are predictive of the perceptual severity of authentic image distortions. Relation to human vision and perception: The responses of neurons in area V of visual cortex perform scale-space orientation decompositions of visual data leading to energy compaction (decorrelation and sparsification) of the data (Field, 987). The feature maps that define our perceptual picture quality model are broadly designed to mimic the processing steps that occur at different stages along the early visual pipeline. Some of the feature maps used (specifically, neighboring pair products, debiased and normalized coefficients, laplacian) have been previously demonstrated to possess powerful perceptual quality prediction capabilities on older, standard legacy picture quality databases. However, we have shown that they perform less effectively on realistic, complex hybrid distortions. As such, we exploit other perceptually relevant features, including a heretofore unused detail ( sigma ) feature drawn from current NSS models as well as chromatic features expressed in various perceptually relevant luminance and opponent color spaces. Specifically, the novel feature maps designed in this work are the sigma map in different color channels, the red-green and blue-yellow color opponent maps, the yellow color map, the difference of Gaussian of the sigma map in different color channels, and the chroma map in the LAB color spaces. Overall, our feature maps model luminance and chrominance processing in the retina, V simple cells, and V complex cells, via both oriented and non-oriented multiscale frequency decomposition and divisive contrast normalization processes operating on opponent color channels. The feature maps are also strongly motivated by recent NSS models (Mittal, Moorthy, & Bovik, 22; Zhang et al., 24; Ruderman, 994; Srivastava, Lee, Simoncelli, & Zhu, 23) of natural and distorted pictures that are dual to low-level perceptual processing models. Distinction from other machine learning methods: Although using NSS models for IQA remains an active research area (Mittal, Moorthy, & Bovik, 22; Zhang et al., 24; Moorthy & Bovik, 2; Goodall, Bovik, & Paulter, 26; Zhang & Chandler, 23), our bag of features approach goes significantly beyond through the use of a variety of heretofore unexploited perceptually relevant statistical picture features. This places it in distinction with respect to ad hoc machine-learning driven computer vision models not founded on perceptual principles (Ye & Doermann, 2; Tang et al., 2; Kang, Ye, Li, & Doermann, 24, 25).

8 Journal of Vision (26) Ghadiyaram & Bovik DMOS <= 25 DMOS >25 and <= 5 DMOS > 5 and <= 75 DMOS > log( ) log( ) log( ) MOS <= 25 MOS >25 and <= 5 MOS > 5 and <= 75 MOS > log( ) Figure 6: 2D scatter plots of the estimated shape and scale parameters obtained by fitting a generalized Gaussian distribution to the histograms of normalized luminance coefficients (NLC) of all the images in the legacy LIVE Database (Sheikh et al., 26) and the LIVE Challenge Database (Ghadiyaram & Bovik, Nov 24, 26). Best viewed in color. Related Work Blind IQA models: The development of blind IQA models has been largely devoted to extracting low-level image descriptors that are independent of image content. There are several models proposed in the past that assume a particular kind of distortion and thus extract distortion-specific features (Ferzli & Karam, 29; Narvekar & Karam, 29; Varadarajan & Karam, 28a, 28b; Sheikh, Bovik, & Cormack, 25; Chen et al., 28; Barland & Saadane, 26; Golestaneh & Chandler, 24; Zhu & Karam, 24). The development of NR IQA models based on natural scene statistics which do not make a priori assumptions on the contained distortion is also experiencing a surge of interest. Tang et al. (Tang et al., 2) proposed an approach combining NSS features along with a very large number of texture, blur, and noise statistic features. The DIIVINE Index (Moorthy & Bovik, 2) deploys summary statistics under an NSS wavelet coefficient model. Another model, BLIINDS-II (Saad et al., 22) extracts a small number of NSS features in the Discrete Cosine Transform (DCT) domain. BRISQUE (Mittal, Moorthy, & Bovik, 22) trains an SVR on a small set of spatial NSS features. CORNIA (Ye & Doermann, 2), which is not an NSS-based model, builds distortion-specific code words to compute image quality. NIQE (Mittal, Soundararajan, & Bovik, 22) is an unsupervised NR IQA technique driven by spatial NSSbased features, that requires no exposure to distorted images at all. BIQES (Saha & Wu, 25) is another recent training-free blind IQA model that uses a model of error visibility across image scales to conduct image quality prediction. The authors of (Kang et al., 24) use a convolutional neural network (CNN), divide an input image to be assessed into non-overlapping patches, and assign each patch a quality score equal to its source image s ground truth score during training. The CNN is trained on these locally normalized image patches and the associated quality scores. In the test phase, an average of the predicted quality scores is reported. This data augmentation and quality assignment strategy could be acceptable in their work (Kang et al., 24) since their model is trained and tested on legacy benchmark datasets containing single homogeneous distortions (Sheikh et al., 26; Ponomarenko et al., 29). However, our method designs a quality predictor for nonhomogeneous, authentic distortions containing different types of distortions affecting different parts of images with varied severities. Thus, the CNN model and the quality assignment strategy in the training phase and the predicted score pooling strategy in the test phase, as used in (Kang et al., 24) cannot be directly extended to the images in the LIVE Challenge Database. Similarly, the authors of (Tang et al., 24) use a deep belief network (DBN)

Journal of Vision (26) Ghadiyaram & Bovik 9 Figure 7: Sample images from the LIVE In the Wild Image Quality Challenge Database (Ghadiyaram & Bovik, Nov 24, 26).

9 Journal of Vision (26) Ghadiyaram & Bovik 9 Figure 7: Sample images from the LIVE In the Wild Image Quality Challenge Database (Ghadiyaram & Bovik, Nov 24, 26). These images include pictures of faces, people, animals, close-up shots, wide-angle shots, nature scenes, man-made objects, images with distinct foreground/background configurations, and images without any notable object of interest. combined with a Gaussian process regressor to train a model on quality features proposed in their earlier work (Tang et al., 2). All of these models (other than NIQE) were trained on synthetic, and usually singly distorted images contained in existing benchmark databases (Sheikh et al., 26; Ponomarenko et al., 29). They are also evaluated on the same data challenging their extensibility on images containing complex mixtures of authentic distortions such as those found in the LIVE Challenge Database (Ghadiyaram & Bovik, Nov 24, 26). Indeed, as we show in our experiments, all of the top-performing models perform poorly on the LIVE Challenge Database. LIVE In the Wild Image Quality Challenge Database We briefly describe the salient aspects of the new LIVE Challenge Database which helps motivate this work. A much more comprehensive description of this significant effort is given in (Ghadiyaram & Bovik, Nov 24, 26). The new LIVE In the Wild Image Quality Challenge Database (Ghadiyaram & Bovik, Nov 24) contains, 63 images impaired by a wide variety of randomly occurring distortions and genuine capture artifacts that were obtained using a wide-variety of contemporary mobile camera devices including smartphones and tablets. We gathered numerous images taken by many dozens of casual international users, containing diverse distortion types, mixtures, and severities. The images were collected without artificially introducing any distortions beyond those occurring during capture, processing, and storage. Figure 7 depicts a small representative sample of the images in the LIVE Challenge Database. Since these images are authentically distorted, they usually contain mixtures of multiple impairments that defy categorization into distortion types. Such images are encountered in the real world and reflect a broad range of difficult to describe (or pigeon-hole) composite image impairments. With a goal to gather a large number of human opinion scores, we designed and implemented an online crowdsourcing system by leveraging Amazon s Mechanical Turk. We used our framework to gather more than 35, human ratings of image quality from more than 8, unique subjects, which amounts to about 75 ratings on each image in the new LIVE Challenge Database. This study is the world s largest, most comprehensive study of real-world perceptual image quality ever conducted. Subject-Consistency Analysis: Despite the widely diverse study conditions, we observed a very high consistency in users sensitivity to distortions in images and their ratings. To evaluate subject consistency, we split the ratings obtained on an image into two disjoint

10 Journal of Vision (26) Ghadiyaram & Bovik equal sets, and computed two MOS values for every image, one from each set. When repeated over 25 random splits, the average linear (Pearson) correlation between the mean opinion scores between the two sets was found to be Also, when the MOS values obtained on a fixed set of images (5 gold standard images) via our online test framework were compared with the scores obtained on the same images from a traditional study setup, we achieved a very high correlation of.985. Both these experiments highlight the high degree of reliability of the gathered subjective scores and of our test framework. We refer the reader to (Ghadiyaram & Bovik, 26) for details on the content and design of the database, our crowdsourcing framework, and the very large scale subjective study we conducted on image quality. The database is freely available to the public at Feature Maps Based Image Quality Faced with the task of creating a model that can accurately predict the perceptual quality of real-world authentically distorted images, an appealing solution would be to train a classifier or a regressor using the distortion-sensitive statistical features that currently drive top-performing blind IQA models. However, as illustrated in Figs. 2-6, complex mixtures of authentic image distortions modify the image statistics in ways not easily predicted by these models. They exhibit large, hard to predict statistical variations as compared to synthetically distorted images. Thus, we devised an approach that leverages the idea that different perceptual image representations may distinguish different aspects of the loss of perceived image quality. Specifically, given an image, we first construct several feature maps in multiple color spaces and transform domains, then extract individual and collective scene statistics from each of these maps. Before we describe the types of feature maps that we compute, we first introduce the statistical modeling techniques that we employ to derive and extract features. Statistical Modeling of Normalized Coefficients Divisive Normalization: Wainwright et al. (Wainwright, Schwartz, & Simoncelli, 22), building on Ruderman s work (Ruderman, 994), empirically determined that bandpass natural images exhibit striking non-linear statistical dependencies. By applying a nonlinear divisive normalization operation, similar to the non-linear response behavior of certain cortical neurons (Heeger, 992), wherein the rectified linear neuronal responses are divided by a weighted sum of rectified neighboring responses greatly reduces such observed statistical dependencies and tends to guassianize the processed picture data. For example, given an image s luminance map L of size M N, a divisive normalization operation (Ruderman, 994) yields a normalized luminance coefficients (NLC) map: where and σ(i, j) = NLC(i, j) = µ(i, j) = 3 k= 3 l= 3 3 k= 3 l= 3 L(i, j) µ(i, j), () σ(i, j) + 3 w k,l L(i k, j l) (2) 3 w k,l [L(i k, j l) µ(i k, j l)] 2, (3) where i, 2..M, j, 2..N are spatial indices and w = {w k,l k = 3,..., 3, l = 3,...3} is a 2D circularly-symmetric Gaussian weighting function. Divisive normalization by neighboring coefficient energies in a wavelet or other bandpass transform domain similarly reduces statistical dependencies and gaussianizes the data. Divisive normalization or contrast-gain-control (Wainwright et al., 22) accounts for specific measured nonlinear interactions between neighboring neurons. It models the response of a neuron as governed by the

Journal of Vision (26) Ghadiyaram & Bovik Luminance Chroma Normalized coefficients of Luma Sigma field M and S color channels DoG of Sigma Yellow Channel Map Hue

constructs several feature maps in multiple transform domains on each of these channel maps (only a few feature maps are illustrated here).

The design of each feature map is described in detail in later sections. responses of a pool of neurons surrounding it.

image perception. Most of the feature maps we construct as part of extracting the proposed bag of features are processed using divisive normalization.

, that the normalized luminance or bandpass/wavelet coefficients of a given image have characteristic statistical properties that are predictably modified by the

A basic modeling tool that we use throughout is the generalized Gaussian distribution (GGD), which effectively models a broad spectrum of (singly) distorted image

A GGD with zero mean is given by: ( ( ) α ) f(x; α, σ 2 α x ) = 2βΓ(/α) exp, (4) β where and Γ(.) is the gamma function: Γ = Γ(/α) β = σ Γ(3/α) t a e t dt a >.

A zero mean distribution is appropriate for modeling NLC distributions since they are (generally) symmetric.

11 Journal of Vision (26) Ghadiyaram & Bovik Luminance Chroma Normalized coefficients of Luma Sigma field M and S color channels DoG of Sigma Yellow Channel Map Hue Saturation Laplacian BY color-opponent map Figure 8: Given any image, our feature maps based model first constructs channel maps in different color spaces and then constructs several feature maps in multiple transform domains on each of these channel maps (only a few feature maps are illustrated here). Parametric scene statistic features are extracted from the feature maps after performing perceptually significant divisive normalization (Ruderman, 994) on them. The design of each feature map is described in detail in later sections. responses of a pool of neurons surrounding it. Further, divisive normalization models account for the contrast masking phenomena (Sekuler & Blake, 22), and hence are important ingredients in models of distorted image perception. Most of the feature maps we construct as part of extracting the proposed bag of features are processed using divisive normalization. Generalized Gaussian Distribution: Our approach builds on the idea exemplified by observations like those depicted in Fig. 2, viz., that the normalized luminance or bandpass/wavelet coefficients of a given image have characteristic statistical properties that are predictably modified by the presence of distortions. Effectively quantifying these deviations is crucial to be able to make predictions regarding the perceptual quality of images. A basic modeling tool that we use throughout is the generalized Gaussian distribution (GGD), which effectively models a broad spectrum of (singly) distorted image statistics, which are often characterized by changes in the tail behavior of the empirical coefficient distributions (Sharifi & Leon-Garcia, 995). A GGD with zero mean is given by: ( ( ) α ) f(x; α, σ 2 α x ) = 2βΓ(/α) exp, (4) β where and Γ(.) is the gamma function: Γ = Γ(/α) β = σ Γ(3/α) t a e t dt a >. (6) A GGD is characterized by two parameters: the parameter α controls the shape of the distribution and σ 2 controls its variance. A zero mean distribution is appropriate for modeling NLC distributions since they are (generally) symmetric. These parameters are commonly estimated using an efficient moment-matching based approach (Sharifi & Leon-Garcia, 995) (Mittal, Moorthy, & Bovik, 22). Asymmetric Generalized Gaussian Distribution Model: Additionally, some of the normalized distributions derived from the feature maps are skewed, and are better modeled as following an asymmetric generalized gaussian distribution (AGGD) (Lasmar, (5)

12 Journal of Vision (26) Ghadiyaram & Bovik 2 Stitou, & Berthoumieu, 29). An AGGD with zero mode is given by: ( ) ν ) ν f(x; ν, σl 2, σr) 2 (β = l +β exp x r)γ(/ν) ( β ( l ) ν ) ν (β l +β exp x r)γ(/ν) ( β r x < x >, (7) where where η is given by: β l = σ l Γ(/α) Γ(3/α) β r = σ r Γ(/α) Γ(3/α), (9) η = (β r β l ) Γ(2/ν) Γ(/ν). () An AGGD is characterized by four parameters: the parameter ν controls the shape of the distribution, η is the mean of the distribution, and σ 2 l, σ2 r are scale parameters that control the spread on the left and right sides of the mode, respectively. The AGGD further generalizes the GGD (?,?)ggd and subsumes it by allowing for asymmetry in the distribution. The skew of the distribution is a function of the left and right scale parameters. If σ 2 l = σ 2 r, then the AGGD reduces to a GGD. All the parameters of the AGGD may be efficiently estimated using the moment-matching-based approach proposed in (Lasmar et al., 29). Although pristine images produce normalized coefficients that reliably follow a Gaussian distribution, this behavior is altered by the presence of image distortions. The model parameters, such as the shape and variance of either a GGD or an AGGD fit to the NLC maps of distorted images aptly capture this non-gaussianity and hence are extensively utilized in our work. Additionally, sample statistics such as kurtosis, skewness, and goodness of the GGD fit, have been empirically observed to also be predictive of perceived image quality and are also considered here. Thus, we deploy either a GGD or an AGGD to fit the empirical NLC distributions computed on different feature maps of each image encountered in (Ghadiyaram & Bovik, Nov 24, 26). Images are naturally multiscale, and distortions affect image structures across scales. Existing research on quality assessment has demonstrated that incorporating multiscale information when assessing quality produces QA algorithms that perform better in terms of correlation with human perception (Saad et al., 22; Wang, Simoncelli, & Bovik, 23). Hence, we extract these features from many of the feature maps at two scales - the original image scale, and at a reduced resolution (low pass filtered and downsampled by a factor of 2). It is possible that using more scales could be beneficial, but we did not find this to be the case on this large dataset, hence only report scores using two scales. Feature Maps Our approach to feature map generation is decidedly a Bag of Features approach, as is highly popular in the development of a wide variety of computer vision algorithms that accomplish tasks such as object recognition (Grauman & Darrell, 25; Csurka, Dance, Fan, Willamowski, & Bray, 24). However, while our approach uses a large collection of highly heterogeneous features, as mentioned earlier, all of them either have a basis in current models of perceptual processing and/or perceptually relevant models of natural picture statistics, or are defined using perceptually-plausible parametric or sample statistic features computed on the empirical probability distributions (histograms) of simple biologically and/or statistically relevant image features. We also deploy these kinds of features on a diverse variety of color space representations. Currently, our understanding of color image distortions is quite limited. By using the Bag of Features approach on a variety of color representations, we aim to capture aspects of distortion perception that are possibly distributed over the different spaces. Figure 8 schematically describes some of the feature maps that are built into our model, while Fig. 9 shows the flow of statistical feature extraction from these feature maps. Further, (8)

Journal of Vision (26) Ghadiyaram & Bovik 3 Divisive Normalization Fit a probability distribution and extract statistical features Feature Maps Learn a Model Figure 9: Our proposed model processes a

wrapped Cauchy distribution, and by extracting perceptually relevant statistical features that are used to train a quality predictor.

Luminance Feature Maps Next we describe the feature maps derived from the luminance component of an

Luminance Map: There is considerable evidence that local center-surround excitatory-inhibitory processes occur at several types of retinal neurons (Kuffler, 953; Bovik, 23), thus providing a bandpass

Thus, given an M N 3 image I in RGB color space, its luminance component is first extracted, which we refer to as the Luma map.

A slight variation from the usual retinal contrast signal model is the use of divisive normalization by the standard deviation (as defined in (3)) of the local responses rather than by the local mean

Two parameters, (α, σ 2 ) are estimated and two sample statistics are computed (kurtosis, skewness) from the empirical distribution over two scales, yielding a total of 8 features.

regarded as essential NSS features related to classical models of retinal processing. b.

13 Journal of Vision (26) Ghadiyaram & Bovik 3 Divisive Normalization Fit a probability distribution and extract statistical features Feature Maps Learn a Model Figure 9: Our proposed model processes a variety of perceptually relevant feature maps by modeling the distribution of their coefficients (divisively normalized in some cases) using either one of GGD (in real or complex domain), AGGD, or wrapped Cauchy distribution, and by extracting perceptually relevant statistical features that are used to train a quality predictor. we will use the images illustrated in Figure - (d) in the below sections to illustrate the proposed feature maps and the statistical variations that occur in the presence of distortions. Luminance Feature Maps Next we describe the feature maps derived from the luminance component of any image considered. a. Luminance Map: There is considerable evidence that local center-surround excitatory-inhibitory processes occur at several types of retinal neurons (Kuffler, 953; Bovik, 23), thus providing a bandpass response to the visual signal s luminance. It is common to also model the local divisive normalization of these non-oriented bandpass retinal responses, as in (Mittal, Moorthy, & Bovik, 22). Thus, given an M N 3 image I in RGB color space, its luminance component is first extracted, which we refer to as the Luma map. A normalized luminance coefficient (NLC) map as defined in () is then computed on it by applying a divisive normalization operation on it (Ruderman, 994). A slight variation from the usual retinal contrast signal model is the use of divisive normalization by the standard deviation (as defined in (3)) of the local responses rather than by the local mean response. The best-fitting GGD model to the empirical distribution of the NLC map is found (Mittal, Moorthy, & Bovik, 22). Two parameters, (α, σ 2 ) are estimated and two sample statistics are computed (kurtosis, skewness) from the empirical distribution over two scales, yielding a total of 8 features. The features may be regarded as essential NSS features related to classical models of retinal processing. b. Neighboring Paired Products: The statistical relationships between neighborhood pixels of an N LC map are captured by computing four product maps that serve as simple estimates of local correlation. These four maps are defined at each coordinate (i, j) by taking the product of NLC(i, j) with each of its directional neighbors NLC(i, j + ), NLC(i +, j), NLC(i +, j + ), and NLC(i +, j ). These maps have been shown to reliably obey an AGGD in the absence of distortion (Mittal, Moorthy, & Bovik, 22). A total of 24 parameters (4 AGGD parameters per product map and two sample statistics - kurtosis, skewness) are computed. These features are computed on two scales yielding 48 additional features. These features use the same NSS/retinal model to account for local spatial correlations. c. Sigma Map: The designers of existing NSS-based blind IQA models, have largely ignored the predictive power of the sigma field (3) present in the classic Ruderman model. However, the sigma field of a pristine image also exhibits a regular structure which is disturbed by the presence of distortion. We extract the sample kurtosis, skewness, and the arithmetic mean of the sigma field at 2 scales to efficiently capture structural anomalies that may arise from distortion. While this feature map has not been used before for visual modeling, it derives from the same NSS/retinal model and is statistically regular. d. Difference of Gaussian (DoG) of Sigma Map: Center-surround processes are known to occur at various stages of visual processing, including the multi-scale receptive fields of retinal ganglion cells (Campbell & Robson, 968). A good model is the 2D

Journal of Vision (26) Ghadiyaram & Bovik 4 (c) (d) Figure : A high-quality image and - (d) a few distorted images from the LIVE Challenge Database (Ghadiyaram & Bovik, Nov 24, 26).

6. The mean subtracted and divisively normalized coefficients of the DoG of the sigma field (obtained by applying (3) on the DoG of the sigma field, denoted henceforth as DoG sigma ) of the luminance

Features that are useful for capturing a broad spectrum of distortion behavior include the estimated shape, standard deviation, sample skewness and kurtosis.

14 Journal of Vision (26) Ghadiyaram & Bovik 4 (c) (d) Figure : A high-quality image and - (d) a few distorted images from the LIVE Challenge Database (Ghadiyaram & Bovik, Nov 24, 26). difference of isotropic Gaussian filters (Wilson & Bergen, 979; Rodieck, 965): ( DoG = (x 2 +y 2 ) 2σ e 2 e 2π σ σ 2 (x 2 +y 2 ) 2σ 2 2 ), () where σ 2 =.5σ. The value of σ in our implementation was.6. The mean subtracted and divisively normalized coefficients of the DoG of the sigma field (obtained by applying (3) on the DoG of the sigma field, denoted henceforth as DoG sigma ) of the luminance map of a pristine image exhibits a regular structure that deviates in the presence of some kinds of distortion (Fig. ). Features that are useful for capturing a broad spectrum of distortion behavior include the estimated shape, standard deviation, sample skewness and kurtosis. The DoG of the sigma field can highlight conspicuous, stand-out statistical features that may particularly affect the visibility of distortions. We next extract the sigma field of DoG sigma and denote its mean subtracted and divisively normalized coefficients as DoG sigma. The sigma field of DoG sigma is obtained by applying (3) on DoG sigma. We found that DoG sigma also exhibit statistical regularities disrupted by the presence of distortions (Fig. ). The sample kurtosis and skewness of these normalized coefficients are part of the list of features that are fed to the regressor. e. Laplacian of the Luminance Map: A Laplacian image is computed as the downsampled difference between an image and a low-pass filtered version of it. The Laplacian of the luminance map of a pristine image is well-modeled as AGGD, but this property is disrupted by image distortions (L. Zhang, Zhang, & Bovik, 25). We therefore compute the Laplacian of each image s luminance map (Luma) and model it using an AGGD. This is also a bandpass retinal NSS model, but without normalization. The estimated model parameters (ν, σ 2 l, σ2 r) of this fit are used as features along with this feature map s sample kurtosis and skewness. f. Features extracted in the wavelet domain: The next set of feature maps are extracted from a complex steerable pyramid wavelet transform of an image s luminance map. This could also be accomplished using Gabor filters (Clark & Bovik, 989) but the steerable pyramid has been deployed quite successfully in the past on NSS-based problems (Sheikh et al., 26; Moorthy & Bovik, 2; Wainwright et al., 22; Sheikh et al., 25). The features drawn from this decomposition are strongly multiscale and multi-orientation, unlike the other features. C-DIIVINE (Zhang et al., 24) is a complex extension of the NSS-based DIIVINE IQA model (Moorthy

15 Journal of Vision (26) Ghadiyaram & Bovik (c) (d).9.8 (c) (d) Number of coefficients Number of coefficients Normalized coefficients Normalized coefficients Figure : Histogram of normalized coefficients of a) DoG sigma and DoG sigma of the luminance components of Figures - (d). & Bovik, 2) which uses a complex steerable pyramid. Features computed from it enable changes in local magnitude and phase statistics induced by distortions to be effectively captured. One of the underlying parametric probability models used by C-DIIVINE is the wrapped Cauchy distribution. Given an image whose quality needs to be assessed, 82 statistical C-DIIVINE features are extracted on its luminance map using 3 scales and 6 orientations. These features are also used by the learner. Chroma Feature Maps Feature maps are also defined on the Chroma map defined in the perceptually relevant CIELAB color space of one luminance (L*) and two chrominance (a* and b*) components (Rajashekar, Wang, & Simoncelli, 2). The coordinate L* of the CIELAB space represents color lightness, a* is its position relative to red/magenta and green, and b* is its position relative to yellow and blue. Moreover, the nonlinear relationships between L*, a*, and b* mimic the nonlinear responses of the L, M, and S cone cells in the retina and are designed to uniformly quantify perceptual color differences. Chroma, on the other hand, captures the perceived intensity of a specific color, and is defined as follows: C ab = a 2 + b 2 where a and b refer to the two chrominance components of any given image in the LAB color space. The chrominance channels contained in the chroma map are entropy-reduced representations similar to the responses of color-differencing retinal ganglion cells. g. Chroma Map: The mean subtracted and divisively normalized coefficients of the Chroma map (2) of a pristine image follow a Gaussian-like distribution, which is perturbed by the presence of distortions (Fig. 2 ) and thus, a GGD model is apt to capture these statistical deviations. We extract two model parameters shape and standard deviation and two sample statistics kurtosis and skewness at two scales to serve as image features. h. Sigma field of the Chroma Map: We next compute a sigma map (as defined in (3)) of Chroma (henceforth referred to as Chroma sigma ). The mean subtracted and divisively normalized coefficients of Chroma sigma of pristine images also obey a unit Gaussian-like distribution which is violated in the presence of distortions (Fig. 2). We again use a GGD to model these statistical deviations, estimate the model parameters (shape and standard deviation), and compute the sample kurtosis and skewness at two scales. All of these are used as features deployed by the learner. Furthermore, as was done on the luminance component s sigma field in the above section, we compute the sample mean, kurtosis, and skewness of Chroma sigma. We also process the normalized coefficients of the Chroma map and generate four neighboring pair (2)

Blind Image Quality Assessment on Real Distorted Images using Deep Belief Nets

Blind Image Quality Assessment on Real Distorted Images using Deep Belief Nets Deepti Ghadiyaram The University of Texas at Austin Alan C. Bovik The University of Texas at Austin Abstract We present a