Robust and Efficient Saliency Modeling from Image Co-Occurrence Histograms

Size: px

Start display at page:

Download "Robust and Efficient Saliency Modeling from Image Co-Occurrence Histograms"

Pearl Berry
5 years ago
Views:

1 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 1, JANUARY Robust and Efficient Saliency Modeling from Image Co-Occurrence Histograms Shijian Lu, Cheston Tan, and Joo-Hwee Lim, Member, IEEE Abstract This paper presents a visual saliency modeling technique that is efficient and tolerant to the image scale variation. Different from existing approaches that rely on a large number of filters or complicated learning processes, the proposed technique computes saliency from image histograms. Several two-dimensional image co-occurrence histograms are used, which encode not only how many (occurrence) but also where and how (co-occurrence) image pixels are composed into a visual image, hence capturing the unusualness of an object or image region that is often perceived by either global uncommonness (i.e., low occurrence frequency) or local discontinuity with respect to the surrounding (i.e., low co-occurrence frequency). The proposed technique has a number of advantageous characteristics. It is fast and very easy to implement. At the same time, it involves minimal parameter tuning, requires no training, and is robust to image scale variation. Experiments on the AIM dataset show that a superior shuffled AUC (sauc) of is obtained, which is higher than the state-of-the-art sauc of Index Terms Saliency modeling, visual attention, image co-occurrence histogram 1 INTRODUCTION Ç THE human visual system is overwhelmed by a tremendous amount of visual information that it cannot process completely [1]. Visual saliency, which reflects how much an image region or object stands out from its surrounding, provides a mechanism for prioritizing visual processing [2], [3]. Computational modeling of visual saliency aims to build a saliency map of the corresponding image or scene, and has a wide range of applications including image/video compression, visual search, object recognition, and so on. It has drawn even more research interest in recent years thanks to the advances in eye-tracking devices, with which human fixations can be recorded while a subject is freely viewing a scene or image. A number of saliency models have been reported in recent years. The reported models can be classified based on whether learning is involved. For models that require no learning, the Itti and Koch model [4], [5] is one of the earliest that computes saliency as the difference between filter responses within each color channel at each image scale. Several models exploit the complexity of image regions that is captured by image entropy [6], image difference [7], or local and global contrast [8], [9]. Some systems have been reported in the past few years that introduce certain high-level factors, for example, face detectors, into the saliency model [10], [11], [12]. Other models have also been reported that exploit context [11], [13], spectral residual [14], [15], image segmentation [16], [17], [18], image color [19], graph topography [20], self-resemblance [21], [22], [23], discriminative spatial information [24], amplitude spectrum modulation [25], adaptive whitening [26], and so on.. The authors are with the Institute for Infocomm Research, A*STAR, I Fusionopolis Way, #21-01 Connexis (South Tower), Singapore {slu, cheston-tan, joohwee}@i2r.a-star.edu.sg. Manuscript received 19 Sept. 2012; revised 5 May 2013; accepted 2 Aug. 2013; published online 15 Aug Recommended for acceptance by M. Brown. For information on obtaining reprints of this article, please send to: tpami@computer.org, and reference IEEECS Log Number TPAMI Digital Object Identifier no /TPAMI Learning-based models compute saliency using the statistics of image features or filter outputs, which are learned from either the image under study or a set of natural images. Bruce and Tsotsos [27], [28], [29] have proposed to first build a sparse representation of image filter statistics through independent component analysis. The saliency of a new image is then computed through selfinformation maximization, where filter statistics are learned from the image under study within an independent component space. The model in [30] similarly computes saliency based on image features and filter statistics under a Bayesian framework. This model is different from [27], [28] in that the filter statistics are learned from a separate set of natural images. Other models learn from saliency features [31], fixational eye-tracking data [10], [32], [33], [34], and so on. Existing saliency models have several limitations. First, many models make use of either local features [6], [7], [16] or global features [8], [14], [19], but saliency modeling should take both into consideration [8], [35]. Take image complexity/difference-based models [6], [7] as examples. High image complexity/difference could have little correlation with high saliency, for example, a small and homogenous image region will have high saliency when most other image regions have dynamic but regular texture. Second, many reported models are slow because they either involve a large number of image filters [5], [27], [30] or compute saliency at multiple image scales [11], [20]. On the other hand, saliency computation as a preprocessing step needs to be performed as fast as possible. Third, some reported models [4], [16], [27] involve many parameters and are not easy to implement. Here, we present a saliency model based on the image cooccurrence histogram (ICH) that has been used for object recognition [36], [37], [38]. The ICH concurrently encodes both global pixel occurrence and local co-occurrence of pixel pairs within a neighborhood window. Visual saliency, which is often perceived by global uncommonness and local discontinuity with respect to the surroundings, can therefore be determined based on the low-frequency pixel occurrence and co-occurrence information. The ICH-based saliency model has several advantageous characteristics. First, it is fast and has potential for use in real-time applications. Second, it requires minimal parameter tuning and is very easy to implement. Third, it is robust and tolerant to image scale variation. Last but not least, it captures both local and global saliency information and demonstrates superior accuracy in predicting human fixations. Our preliminary work on saliency modeling based on image histograms was reported in [39]. The saliency model in this paper is different in several aspects. First, redundant information from the one-dimensional (1D) histogram is discarded because it increases computation cost but has little effect on the saliency modeling performance. Second, the occurrence/co-occurrence of image gradient orientation is incorporated because it is closely correlated with the perceptual visual saliency and helps to improve the saliency modeling performance clearly. Third, more experiments have been conducted. In particular, one specific application in object detection and segmentation has been studied and preliminary results are presented. The rest of this paper is organized as follows: Section 2 discusses the ICH construction and some ICH properties. Section 3 describes how to compute saliency from an ICH. Section 4 then discusses how a saliency map is built from the saliency of different channel images. Section 5 evaluates the proposed saliency model on two public datasets. Finally, concluding remarks are drawn in Section 6. 2 IMAGE CO-OCCURRENCE HISTOGRAMS The image histogram represents the distribution of image values. It has a number of good properties, for example, robustness to image noise, rotation, scale variation, and so on. The traditional 1D image histogram captures only the image value occurrence, /14/$31.00 ß 2014 IEEE Published by the IEEE Computer Society

whereas information about the local spatial composition of image pixels is completely discarded (which is instead very important to the perception of an image).

2 196 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 1, JANUARY 2014 Fig. 1. (a) A sample image from the AIM dataset [28]. (b) The ICH of the color channel b (in the Lab color space) of the sample image shown in Fig. 1a. (c) Two sample rows of the ICH as shown in Fig. 1b. whereas information about the local spatial composition of image pixels is completely discarded (which is instead very important to the perception of an image). We show that a two-dimensional (2D) ICH is capable of capturing both occurrence and cooccurrence of image pixels and can be used to calculate a good measure of visual saliency. For ease of description and illustration, we will use a single channel image to explain the process of ICH construction and saliency computation as described in this section and Section 3. The construction of a final saliency map by combining the saliency of multiple channel images will be described in Section 4. Consider a single-channel integer image I. Let IK ¼f1; 2;...;kg be a set of k possible image values within I (k is 256 for an 8-bit integer image). H, the ICH of the image I, is defined as follows: H ¼ hðm; nþ ;m;n2 IK; where H is a symmetric square matrix of size k k. An ICH element hðm; nþ is the co-occurrence count of image values m and n within a square neighborhood window of size z. H is constructed as follows. For each image pixel with a value of m, all image pixels within the local neighborhood window are examined one by one. If a neighboring pixel has a value of n, the ICH element hðm; nþ is increased by one. The ICH is built after all image pixels within I are examined as described above. The ICH captures both occurrence and co-occurrence of image values as defined in (1). In particular, each image value will pair with itself to account for many diagonal elements of the ICH through which global occurrence information is captured. At the same time, each pixel will also pair with a number of neighboring pixels to account for nondiagonal elements of the ICH through which local co-occurrence information is captured. The neighborhood size z can be set between 1 and 4 without affecting the performance much as discussed in Section 5. It should be noted that some ICH elements could be zero if some image value pairs never occur within the same local neighborhood throughout the image. For the sample image in Fig. 1a, Fig. 1b shows the ICH of its b channel image in the Lab color space when the neighborhood size z is set at 6 (the ICH is normalized to sum up to 1 as specified in (2)). Fig. 1c shows two sample rows of the ICH as shown in Fig. 1b. As shown in Figs. 1b and 1c, most pixel value pairs are captured by ICH elements around the diagonal. This is reasonable because each pixel will pair with itself and the values of most neighboring pixel pairs are often the same or close to each other, especially for images with a high level of homogeneity. Besides, high-contrast pixel value pairs are captured by ICH elements far from the ICH diagonal and the value of ICH elements drops quickly when the distance between the ICH elements and the ICH diagonal increases. ð1þ The ICH has two nice properties: It is tolerant to changes in the image scale and the neighborhood size z. This is illustrated in Fig. 2, where Figs. 2a and 2b show two sample rows of four ICHs (normalized) that are constructed when z is set to 3 versus 6 pixels and when the image is at original versus half original scale, respectively. As shown in Figs. 2a and 2b, ICHs are similar to each other even when z and the image scale are very different. The main difference is that when z is smaller or the image is at a higher resolution, the ICH become more condensed with a higher peak at the diagonal and a sharper falloff rate with the distance from the ICH diagonal. This is reasonable because when z is smaller (or similarly, when the image scale is larger), only closer neighboring pixels are considered and they are more likely to have similar image values. These two ICH properties explain why the resulting saliency is tolerant to the image scale variation and why it involves minimal parameter tuning (since z is the only model parameter). 3 SALIENCY MODELING FROM ICHS Image saliency is typically captured by two types of ICH elements. The first are those far from the diagonal that correspond to highcontrast pixel pairs such as those lying near the boundary of the red bell pepper shown in Fig. 1a. The second are those lying around the diagonal that correspond to pixel pairs with similar (but relatively infrequent) image values such as those lying within the red bell pepper. With the ICH H described in the previous section, a probability mass function (PMF) P can be computed as follows: H P ¼ P k P k ; ð2þ m¼1 n¼1 hðm; nþ where hðm; nþ denotes the H element at ðm; nþ as defined in (1). As shown in Figs. 1b and 1c, P actually captures the divergence of an image s statistical distribution when compared with a uniform distribution. Since saliency is usually negatively correlated with occurrence/co-occurrence, an inverted PMF P is computed as follows: 8 < 0; if pðm; nþ ¼0; pðm; nþ ¼ 0; if pðm; nþ >U; ð3þ : U pðm; nþ; if pðm; nþ; U; where pðm; nþ denotes an element of P. As defined in (3), elements of P are set to 0 when there are no corresponding pixel value pairs within the image or when the corresponding P elements are larger than a certain threshold (i.e., they are common and, therefore, inconspicuous). The threshold U in (3) denotes a uniform distribution, which is set based on the average of nonzero elements within P as follows:

3 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 1, JANUARY Fig. 2. PMF (normalized ICH) is tolerant to changes in image scale and neighborhood size z: For the b channel of the image in Fig. 1a, (a) and (b) show two sample rows of four PMFs, which are constructed when the image is at the original versus half original scale and when z is set to 3 versus 6, respectively. 1 U ¼ P ; IN ZðPÞ where IN ZðP Þ denotes a binary nonzero function that sets all nonzero elements within P to 1 and the rest to 0. The denominator is therefore the number of nonzero elements within P. U is defined based on the rationale that any image values or image value pairs that are more common than average should not be considered salient. Saliency can then be computed from P. For each image pixel at location ði; jþ, the corresponding image saliency Sði; jþ is computed as follows: Sði; jþ ¼ Xiþz X jþz i 0 ¼i z j 0 ¼j z p xði; jþ;xði 0 ;j 0 Þ ; where z denotes the size of the neighborhood window, which is the same as used for the ICH construction as described in Section 2. The notations xði; jþ and xði 0 ;j 0 Þ denote image values at locations ði; jþ and ði 0 ;j 0 Þ, respectively, and pðxði; jþ;xði 0 ;j 0 ÞÞ is therefore the element of P indexed by xði; jþ and xði 0 ;j 0 Þ. P captures both local discontinuity and global uncommonness information. Fig. 3 illustrates this by using two synthetic images, as shown in Figs. 3a and 3b. The gray areas within the two images have a value of and so have the same contrast (discontinuity) to the black and white circles and surrounding backgrounds (with values of 0 and 255, respectively). Figs. 3c and 3d show the computed saliency maps. As shown in Figs. 3c and 3d, P captures local discontinuity where the boundaries of all circles and squares are at least somewhat salient. More importantly, P captures occurrence of co-occurrence information, where co-occurrence patterns with a lower occurrence frequency have higher saliency. Take the first synthetic image (in Fig. 3a) as an example. Black-gray cooccurrence is much rarer than white-gray co-occurrence. As a result, much higher saliency is determined around the black circle boundary compared with that around the boundary of the white circle and gray rectangles. ð4þ ð5þ For each of the three channel images, image values are first linearly mapped to (first subtracted by the minimum image value, then divided by the maximum image value, and finally multiplied by 255) and then rounded to integers for ICH construction and saliency computation. Besides image lightness and image color, image gradient orientation is also incorporated because it is often closely correlated with the perceptual visual saliency. For each image pixel, the image gradient orientation is first quantized to 180 bins (i.e., 1 to 180 degrees). An orientation co-occurrence histogram H o of size can be constructed accordingly in the similar way as H in (1). The corresponding PMF function P o can be similarly constructed as specified in (2)-(4). The orientation-related saliency S o can, thus, be computed in the similar way as S as defined in (5) as follows: S o ði; jþ ¼ Xiþz X jþz i 0 ¼i z j 0 ¼j z p o xði; jþ;xði 0 ;j 0 Þ ; ð6þ 4 SALIENCY MAP CONSTRUCTION We compute saliency within the Lab color space that takes advantage of human color-response characteristics. The Lab color space has three image channels including L encoding image lightness information and a, b encoding image color information. Fig. 3. P captures local discontinuity and global uncommonness : For the two synthetic images in (a) and (b), the computed saliency after Gaussian smoothing is shown in (c) and (d).

198 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 1, JANUARY 2014 Fig. 4.

respectively. (d) and (e) show the combined saliency before and after Gaussian smoothing, respectively. (f) shows the corresponding fixational map.

It should be noted that not all image pixels are included in the H o construction and the ensuing S o computation.

A saliency map is finally computed by combining saliency that is computed in the image value domain and image gradient orientation domain as follows: S ¼ GðS v þ S o Þ; where GðÞ denotes a standard

4 198 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 1, JANUARY 2014 Fig. 4. Saliency map construction: For the image in (a), (b) and (c) show the image saliency S v and S o that are computed based on the corresponding ICH and image orientation co-occurrence histogram, respectively. (d) and (e) show the combined saliency before and after Gaussian smoothing, respectively. (f) shows the corresponding fixational map. where p o ðxði; jþ;xði 0 ;j 0 ÞÞ denotes the element of P o that is indexed by image values xði; jþ and xði 0 ;j 0 Þ. It should be noted that not all image pixels are included in the H o construction and the ensuing S o computation. In particular, image pixels whose gradient is smaller than the mean gradient of the whole image are excluded because they usually lie around inconspicuous image regions. A saliency map is finally computed by combining saliency that is computed in the image value domain and image gradient orientation domain as follows: S ¼ GðS v þ S o Þ; where GðÞ denotes a standard Gaussian smoothing function that is widely used to convert the computed visual saliency into a saliency map [5]. S v denotes the saliency of image values as defined in (5) and S o denotes the saliency of image gradient orientations as defined in (6), which are determined by averaging the saliency that is computed over L, a, and b image channels as follows: S v ¼ X3 n¼1 S o ¼ X3 n¼1 S v;n ; S o;n ; where S v;n and S o;n, n ¼ 1;...; 3, denote the saliency of the nth channel image that is computed within the image value domain and image gradient orientation domain, respectively, as specified in (5) and (6). S v;n and S o;n, i ¼ 1;...; 3, in (8) are first normalized to 0-1 before they are combined within the image value domain and image gradient orientation domain, respectively. The normalization could alleviate the situation where one channel image with very high saliency dominates the combined overall saliency. On the other hand, our study shows that the normalization does not improve the prediction accuracy of the proposed model much when evaluated over a dataset with multiple images. Similar accuracy (slightly lower) can be obtained by summing up the channel saliency S v;n and S o;n (as well as S v and S o in (7)) directly without normalization. Fig. 4 illustrates the proposed saliency modeling technique. For the sample image in Fig. 4a, Figs. 4b and 4c show the saliency S v and S o that are computed within the image value domain and image orientation domain, respectively, as specified in (8). Figs. 4d and 4e show the combined saliency before and after the Gaussian smoothing, respectively, as specified in (7). As Figs. 4d and 4e ð7þ ð8þ Fig. 5. Comparison of the proposed saliency model with state-of-the-art models: For the sample images from the AIM data set shown in the first row, rows 2 and 3 show the corresponding fixational maps as described in Section 5.1 and the saliency maps by our proposed model. Rows 4-10 show the corresponding saliency maps that are computed by the CCH model [39], context model [11], signature model [15], AWS model [26], frequency-tuned model [19], SUN model [30], and AIM model [28], respectively. show, the proposed saliency model predicts the human fixations accurately as shown in Fig. 4f. 5 EXPERIMENTAL RESULTS This section presents experimental results including dataset description, qualitative illustrations, quantitative results, and discussion. 5.1 Data Sets We evaluate the proposed saliency model by using the AIM dataset described in [28] and the SR dataset described in [14]. The AIM dataset is created from eye-tracking experiments performed while participants are freely viewing 120 static images. For each image, fixation points of 20 subjects are collected and a fixational map is determined through Gaussian smoothing of the collected fixational points, as illustrated in the second row of Fig. 5. The SR dataset consists of 62 static images, and for each image, salient image regions are first manually labeled by four subjects, which are then averaged to form a hit map as illustrated in the second column of Fig Qualitative Results We first qualitatively compare our model with seven state-of-theart models including context model [11], signature model [15], frequency-tuned (FT) model [19], AWS model [26], AIM model [28], SUN model [30], and CCH model [39] (the implementations of the state-of-the-art models are downloaded from the authors websites). Fig. 5 shows several images within the AIM dataset

5 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 1, JANUARY TABLE 1 sauc and Speed of the Proposed Model and the Seven Comparison Models over the AIM Dataset [28] Fig. 6. ICH-based saliency varies little with the neighborhood size z: For the first image in each row, the second column shows the corresponding hit map and the third to sixth columns show the ICH-based saliency maps when z is set to 2, 4, 6, and 8, respectively. and the corresponding saliency maps. For each image in the first row, the images in the second and third rows show the corresponding fixational maps and the ICH-based saliency maps (z is set at 2), respectively. The images in rows 4-10 show the saliency maps that are computed by the models in [39], [11], [15], [26], [19], [30], [28] (as listed in Table 1), respectively. It should be noted that for illustration in Fig. 5, a smoothing window of 0.04 of the image width is used for all evaluated models (multiple rounds of smoothing are implemented for quantitative study, to be discussed later). The histogram-based saliency has several advantageous characteristics, as illustrated in Fig. 5. First, it is more discriminative and predicts the human fixations more accurately compared with the state-of-the-art models. In particular, saliency maps by the two learning-based models [30], [28] in rows 9 and 10 are blurred where unfixated image regions are also relatively salient. This could be due to the learned saliency features, some of which exist within both salient and nonsalient image regions. In addition, many state-of-the-art models [19], [26], [28], [30] are somewhat sensitive to complex texture such as the trees in the sixth image in Fig. 5, as illustrated by the relatively high saliency in the top-right portions of the saliency maps in the rightmost column of the last four rows. Histogram-based saliency also involves minimal parameter tuning where the only parameter is the neighborhood size z. But the variation of z has little effect on the image co-occurrence histograms and hence the computed image saliency. This can be illustrated in Fig. 6, where for the first image from the SR dataset [14] in each row, the graphs in column 2 show the corresponding hit map and those in columns 3-6 show the histogram-based saliency maps when z is set to 2, 4, 6, and 8 pixels, respectively. As shown in Fig. 6, the histogram-based saliency is close to each other when z is set to very different values. A quantitative experiment has also been conducted for the SR dataset. The experiment shows that an optimal shuffled area under the receiver operating characteristic (ROC) curve (sauc) of is obtained (when the neighborhood size z is 4 pixels and images are at 0.35 of the original image scale). 5.3 Quantitative Results Quantitative experiments have also been conducted based on the AIM dataset where the performance is measured through the analysis of the receiver operating characteristic. For each image in the AIM dataset, multiple thresholds are first selected to convert the saliency map and fixational map into multiple pairs of binary maps. True positives (TP) and false positives (FP) are then determined. An ROC curve and the corresponding shuffled area under the ROC curve are further computed. We also follow the ROC computation procedure in [40] to compensate for center-bias that commonly exists within the human fixations and often affects the performance evaluation [30], [40]. Table 1 shows the optimal sauc of the proposed model and the seven compared models, where the optimal sauc is the maximum sauc when images are resized to 10 different image scales, as shown in Fig. 7a, and at each image scale, 25 rounds of Gaussian smoothing are implemented by changing the smoothing window size from 0.01 to 0.13 of the image width with an increase step of as described in [15]. As shown in Table 1, the proposed model obtains higher sauc than all seven compared models. In particular, the sauc of the proposed model is higher than that of our earlier histogram-based model in [39]. The higher sauc is largely due to the incorporation of the image gradient orientation information as described in Section 4. In addition, the AWS model [26] gets slightly lower sauc, but it is clearly slower compared with the proposed model. Table 1 also shows the average execution time (over the AIM dataset) of all evaluated models that are tested on the same desktop PC. Differently from the original implementations, the image scale is consistently set to 0.5 of the original scale for all evaluated models for fair comparison. As shown in Table 1, the execution time of the proposed model is around 0.34 seconds that is close to the models in [15], [39], [19] but significantly faster than the other four [26], [28], [11], [30]. The speed advantage is due to the histogram operations that involve only simple computations, whereas many reported models involve a large number of filters of different dimensions, for example, 25 filters of 1,323 dimensions in [28] and 362 filters of 363 dimensions in [30]. Note that saliency in [11] is computed and averaged over four different image scales. Fig. 7a shows the sauc of the proposed model and six of seven comparison models when the image is resized from 1.0 to 0.1 of the original scale (a single sauc of is obtained for the model [11] where the saliency is computed and averaged at four image scales in the authors implementation). As shown in Fig. 7a, the sauc of the models in [15], [19], [28], [30] changes greatly with variation of the image scale. In particular, the sauc in [15] increases with the decrease of the image scale when image details such as inconspicuous edges are suppressed. On the other hand, the sauc in [28] decreases greatly when the image scale decreases. The large decrease could be explained by the fact that the model in [28] learns the filter statistics from the image under study (instead of many other images [30]). The model in [19] obtains low sauc because it was designed to extract salient objects from images with a large portion of homogeneous background. As a comparison, the sauc of the proposed model is higher and more stable when image scale changes. The better scale tolerance could be due to the adopted ICH that counts image pixel occurrence/co-occurrence

200 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 1, JANUARY 2014 Fig. 7. (a) sauc of the proposed model and four comparison models when image scale changes from 0.

0 of the original image scale (the neighborhood size is fixed at 2); (b) sauc of the proposed model when the neighborhood size z changes from 0 to W/40 (W denotes the image width), and when the image

6 200 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 1, JANUARY 2014 Fig. 7. (a) sauc of the proposed model and four comparison models when image scale changes from 0.1 to 1.0 of the original image scale (the neighborhood size is fixed at 2); (b) sauc of the proposed model when the neighborhood size z changes from 0 to W/40 (W denotes the image width), and when the image is set at 1, 0.5, and 0.25 of the original image scale, respectively. (Note that the y-axis scales of the graphs in (a) and (b) are different.) and therefore has low sensitivity to the image scale variation as illustrated in Fig. 2a. Fig. 7b shows the sauc of the proposed model when the neighborhood size z increases from W/120 to W/40 (W denotes the image width) pixels and when the images are scaled to 1.0, 0.5, and 0.25 of the original scale, respectively. As shown in Fig. 7b, the performance of the proposed model is stable when z changes greatly (note the scale of the y-axis). In addition, a special case is tested by setting z to zero when the ICH becomes a 1D histogram. The results in Fig. 7b show that the sauc becomes lower, which clearly demonstrates the contribution of the local image pair cooccurrence that is captured by the ICH. 5.4 Discussion The proposed saliency model can be used in different applications, such as object detection and segmentation, autonomous viewpoint control, advertisement design, and so on. Take the object detection and segmentation as an example. Many objects can be detected and segmented through thresholding of the ICH-based saliency maps because they are often visually different from the surrounding and have high saliency. This is illustrated in Fig. 8, where for the images (from the AIM dataset) in the first column, columns 2 and 3 show the corresponding ICH-based saliency maps and the segmented objects, respectively (a global threshold is set at four times of the mean of the saliency map). As shown in Fig. 8, a number of objects with meaningful semantics are successfully detected and segmented based on the ICH-based saliency. Note that some object context is also segmented due to the Gaussian smoothing. The proposed model could be improved in several aspects. In particular, optimal combination of the saliency from different image channels needs to be studied. Currently, saliency from different image channels and domains are simply normalized and summed together. But for some images, saliency from certain channels predicts the human fixations well, whereas that from other channels has little correlation with the human fixations. A better saliency model could be derived through optimal weighting of saliency from different image channels and domains. In addition, the incorporation of objects with high-level semantics will be explored. Currently, the ICH-based model just captures certain low-level features but objects with high-level semantics often predominantly attract our attention. Some systems [10], [11], [12] have been reported to introduce a face detector into a generic saliency model to enhance the prediction accuracy. The incorporation of detection of some other objects such as text, humans, and cars will be more useful for tasks such as target search. 6 CONCLUSION This paper presents a saliency model that makes use of 2D cooccurrence histograms. Compared with state-of-the-art models, the proposed model has several advantageous characteristics: It is fast with potential for real-time applications; it is tolerant to image scale variation; it involves minimal parameter tuning and is easy to implement; it predicts human fixations accurately and obtains a superior sauc. Several issues will be further studied, including adaptive weighting of saliency from different image channels and the incorporation of the top-down factors. Fig. 8. Object detection and segmentation using the histogram-based saliency: For the sample images in the first column, columns 2 and 3 show the corresponding histogram-based saliency maps and the detected and segmented objects, respectively.

7 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 1, JANUARY REFERENCES [1] J. Tsotsos, Analyzing Vision at the Complexity Level, Behavioral and Brain Sciences, vol. 13, no. 3, pp , [2] C.M. Moore and H. Egeth, How Does Feature-Based Attention Affect Visual Processing? J. Experimental Psychology: Human Perception and Performance, vol. 24, no. 4, pp , [3] A. Borji, D.N. Sihite, and L. Itti, Quantitative Analysis of Human-Model Agreement in Visual Saliency Modeling: A Comparative Study, IEEE Trans. Image Processing, vol. 22, no. 1, pp , Jan [4] L. Itti and C. Koch, Computational Modeling of Visual Attention, Nature Rev. Neuroscience, vol. 2, no. 3, pp , [5] L. Itti, C. Koch, and E. Niebur, A Model of Saliency-Based Visual Attention for Rapid Scene Analysis, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 20, no. 11, pp , Nov [6] T. Kadir and M. Brady, Saliency, Scale and Image Description, Int l J. Computer Vision, vol. 45, no. 2, pp , [7] D. Gao and N. Vasconcelos, Bottom-Up Saliency Is a Discriminant Process, Proc. 11th IEEE Int l Conf. Computer Vision, pp. 1-6, [8] M.M. Cheng, G.X. Zhang, N. Mitra, X. Huang, and S.M. Hu, Global Contrast Based Salient Region Detection, Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp , [9] F. Perazzi, P. Krahenbuhl, Y. Pritch, and A. Hornung, Saliency Filters: Contrast Based Filtering for Salient Region Detection, Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp , [10] T. Judd, K. Ehinger, F. Durand, and A. Torralba, Learning to Predict Where Humans Look, Proc. 12th IEEE Int l Conf. Computer Vision, pp , [11] S. Goferman, L. Zelnik-Manor, and A. Tal, Context-Aware Saliency Detection, Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp , [12] A. Borji, D.N. Sihite, and L. Itti, Probabilistic Learning of Task-Specific Visual Attention, Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp , [13] L. Wang, J. Xue, N. Zheng, and G. Hua, Automatic Salient Object Extraction with Contextual Cue, Proc. IEEE Int l Conf. Computer Vision, pp , [14] X. Hou and L. Zhang, Saliency Detection: A Spectral Residual Approach, Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1-8, [15] X. Hou, J. Harel, and C. Koch, Image Signature: Highlighting Sparse Salient Regions, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 34, no. 1, pp , Jan [16] T. Liu, J. Sun, N. Zheng, X. Tang, and H. Shum, Learning to Detect a Salient Object, Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1-8, [17] T. Liu, Z. Yuan, J. Sun, J. Wang, N. Zheng, X. Tang, and H. Shum, Learning to Detect a Salient Object, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 33, no. 2, pp , Feb [18] J. Feng, Y. Wei, L. Tao, C. Zhang, and J. Sun, Salient Object Detection by Composition, Proc. IEEE Int l Conf. Computer Vision, pp , [19] R. Achanta, S. Hemami, F. Estrada, and S. Susstrunk, Frequency-Tuned Salient Region Detection, Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp , [20] J. Harel, C. Koch, and P. Perona, Graph-Based Visual Saliency, Proc. Advances in Neural Information Processing Systems, pp , [21] H.J. Seo and P. Milanfar, Static and Space-Time Visual Saliency Detection by Self-Resemblance, J. Vision, vol. 12, no. 15, pp. 1-15, [22] L. Marchesotti, C. Cifarelli, and G. Csurka, A Framework for Visual Saliency Detection with Applications to Image Thumbnailing, Proc. 12th IEEE Int l Conf. Computer Vision, pp , [23] L. Duan, C. Wu, J. Miao, L. Qing, and Y. Fu, Visual Saliency Detection by Spatially Weighted Dissimilarity, Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp , [24] G. Sharma1, F. Jurie, and C. Schmid, Discriminative Spatial Saliency for Image Classification, Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp , [25] D. Chen and H. Chu, Scale-Invariant Amplitude Spectrum Modulation for Visual Saliency Detection, IEEE Trans. Neural Networks and Learning Systems, vol. 23, no. 8, pp , Aug [26] A. Garcia-Diaz, X.R. Fdez-Vidal, X.M. Pardo, and R. Dosil, Saliency from Hierarchical Adaptation through Decorrelation and Variance Normalization, Image and Vision Computing, vol. 30, pp , [27] N. Bruce and J. Tsotsos, Saliency Based on Information Maximization, Proc. Advances in Neural Information Processing Systems, pp , [28] N. Bruce and J. Tsotsos, Saliency, Attention, and Visual Search: An Information Theoretic Approach, J. Vision, vol. 9, no. 3, pp. 1-24, [29] N. Bruce, Image Analysis through Local Information Measures, Proc. Int l Conf. Pattern Recognition, pp , [30] L. Zhang, M.H. Tong, T.K. Marks, and G.W. Cottrell, SUN: A Bayesian Framework for Saliency Using Natural Statistics, J. Vision, vol. 8, no. 7, pp. 1-20, [31] P. Wang, J. Wang, G. Zeng, J. Feng, H. Zha, and S. Li, Salient Object Detection for Searched Web Images via Global Saliency, Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp , [32] W. Kienzle, F.A. Wichmann, B. Scholkopf, and M.O. Franz, A Nonparametric Approach to Bottom Up Visual Saliency, Proc. Advances in Neural Information Processing Systems, pp , [33] Q. Zhao and C. Koch, Learning a Saliency Map Using Fixated Locations in Natural Scenes, J. Vision, vol. 3, no. 9, pp. 1-15, [34] Q. Zhao and C. Koch, Learning Visual Saliency by Combining Feature Maps in a Nonlinear Manner Using AdaBoost, J. Vision, vol. 12, no. 6, pp. 1-15, [35] A. Borji and L. Itti, Exploiting Local and Global Patch Rarities for Saliency Detection, Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp , [36] J. Huang, S.R. Kumar, M. Mitra, W.J. Zhu, and R. Zabih, Image Indexing Using Color Correlograms, Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp , [37] A. Rao, R.K. Srihari, and Z. Zhang, Spatial Color Histograms for Content- Based Image Retrieval, Proc. 11th IEEE Int l Conf. Tools with Artificial Intelligence, pp , [38] P. Chang and J. Krumm, Object Recognition with Color Cooccurrence Histograms, Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp , [39] S. Lu and J.H. Lim, Saliency Modeling from Image Histograms, Proc. European Conf. Computer Vision, pp , [40] B. Tatler, R. Baddeley, and I. Gilchrist, Visual Correlates of Fixation Selection: Effects of Scale and Time, Vision Research, vol. 45, no. 5, pp , For more information on this or any other computing topic, please visit our Digital Library at

Supplementary Materials for Salient Object Detection: A

Supplementary Materials for Salient Object Detection: A Discriminative Regional Feature Integration Approach Huaizu Jiang, Zejian Yuan, Ming-Ming Cheng, Yihong Gong Nanning Zheng, and Jingdong Wang Abstract