Dynamic saliency models and human attention: a comparative study on videos

Size: px

Start display at page:

Download "Dynamic saliency models and human attention: a comparative study on videos"

Jonas Singleton
5 years ago
Views:

1 Dynamic saliency models and human attention: a comparative study on videos Nicolas Riche, Matei Mancas, Dubravko Culibrk, Vladimir Crnojevic, Bernard Gosselin, Thierry Dutoit University of Mons, Belgium, University of Novi Sad, Serbia Abstract. Significant progress has been made in terms of computational models of bottom-up visual attention (saliency). However, efficient ways of comparing these models for still images remain an open research question. The problem is even more challenging when dealing with videos and dynamic saliency. The paper proposes a framework for dynamicsaliency model evaluation, based on a new database of diverse videos for which eye-tracking data has been collected. In addition, we present evaluation results obtained for 4 state-of-the-art dynamic-saliency models, two of which have not been verified on eye-tracking data before. 1 Introduction When faced with visual stimuli the human vision system (HVS) does not process the whole scene in parallel. Part of the visual information sensed by the eyes is discarded in a systematic manner to attend to objects of interest [1][2]. The ability to orientate rapidly towards salient objects in a cluttered visual scene allows an organism to detect quickly possible prey, mates or predators in the visual world, making it a clear evolutionary advantage [1]. Current research considers attentional deployment as a two-component mechanism [1]. Subjects selectively direct attention to objects in a scene using both bottom-up, image-based saliency cues and top-down, task-dependent cues; an idea that dates back to 19 th century work of William James [3]. Bottom-up processing is driven by the stimulus presented [4]. Some stimuli are conspicuous or salient in a given context. This suggests that saliency is computed in a pre-attentive and involuntary manner across the entire visual field, most probably in terms of hierarchical centre-surround mechanisms. Similar goes for the moving stimuli; they are perceived to be moving only if they are undergoing motion different from their wider surround [2]. While bottom-up factors that influence attention are well understood [5], the integration of top-down knowledge into these models remains an open taskspecific problem. Because of this and the fact that bottom-up saliency can be seen as a generic approximation of attention in the very first viewing moments, applications of visual attention commonly rely on bottom-up models [6][7][8][9][10]. Significant progress has been made in terms of computational models of bottom-up visual saliency when working with still images [11][12][13][14]. Motion cues, despite their importance, have until recently been usually considered

2 2 Authors Suppressed Due to Excessive Length as an extension of the static saliency. Dynamic saliency models, operating on videos, are gaining interest [15][10][16]. Effective ways of comparing the performance of such models are still lacking as video-based saliency raises several issues in comparative studies which are novel when compared to still image studies. Moreover, while eye-tracking databases for still images are becoming available for model assessment, video databases are still scarce. In this study, we propose a framework for dynamic-saliency model comparison, based on ground truth eye-tracking data collected from humans viewing a new video freely available database [17]. We evaluate 4 state-of-the-art saliency models and discuss their performance on the proposed database. Two of the models proposed have not previously been verified using eye-tracking data and none have been evaluated using such an extensive set of videos. The rest of the paper is organized as follows: Section 2 provides an overview of the related published work. Section 3 is dedicated to the description of the proposed methodology. The results achieved can be found in Section 4. Finally, Section 5 contains the discussion and our conclusions. 2 Related Work Zhang et al. [14] proposed a Bayesian framework for saliency detection (SUN model). Their approach is based on the assumption that visual saliency is the probability of a target (e.g. food or predators) at every location given the visual features observed. Bottom-up saliency is considered equal to what is known in information theory as self-information of the random variable that represents the visual features of point. Saliency, therefore is dependent solely on the probability distribution of feature values. Zhang et al. estimate this probability by computing feature response maps on a set of 138 images of natural scenes. The distribution for each feature is parametrized by fitting a zero-mean generalized Gaussian. The authors did not propose their own features, but chose to test their model using biologically-plausible Difference-of-Gaussian filters, akin to those used in the seminal paper of Itti et al. [18]. In addition, features proposed by Bruce and Tsotsos [12] were considered, which are linear features derived form a set of natural images using Independent Component Analysis (ICA). In their study Zhang et al. concluded that the performance of their approach is much better when independent ICA features are used instead of dependent DoG features. Therefore, these form the foundation of the algorithm tested in this study. Seo and Milanfar [13] proposed a framework for static and space-time saliency detection (SEO model). Their saliency model relies solely on centre-surround differences between a local window centred on a pixel location and its neighbouring windows. To represent image structure in the windows, they proposed the use of Local Steering Kernels (LSK). These features are canonical kernels fitted to pixel value differences based on estimated gradients. To take motion into account and achieve space-time (dynamic) saliency detection, Seo and Milanfals extend the pixel location with the time coordinate: the central window and the surrounding region become spatio-temporal cubes. Similar to the approach of Zhang et

3 Author Guidelines for ACCV Final Paper 3 al., they estimate the conditional probability of a point being salient, depending on the features extracted for both the central window and the surrounding region. However, rather than using a parametric estimate, Seo and Milanfar estimated this conditional probability using Kernel Density Estimation (KDE). The saliency, then, is computed as a centre value of a normalized adaptive kernel computed for the centre + surround region. In [10], Culibrk et al. proposed the use of a multi-scale background modelling and foreground segmentation approach, as an efficient saliency model (Culibrk model) driven by both motion and simple static cues, which adheres to the motion-saliency principles reported in [2]. The model employs the principles of multi-scale processing, cross-scale motion consistency, outlier detection and temporal coherence. The algorithm employs a multi-scale model of the background in the form of a Gaussian pyramid. The background frames at each level are obtained by infinite impulse response (running average) filtering commonly used in background subtraction [19][20]. This allows the approach to take into account temporal consistency in the frames and to reduce the saliency of static objects as time passes. Rather than attempting to learn the distribution of the features, outlier detection [21] is used to detect salient changes in the frame. The assumption is that the salient changes are those that differ significantly from the changes undergone by most of the pixels in the frame. Mancas model proposed in [15], uses only dynamic features such as speed and direction. No static cues about colours, grey levels or orientations are included in the model. The algorithm is based on three main steps: motion feature extraction, spatio-temporal filtering, and rare motion extraction. The model is intended to quantify rare and abnormal motion. As a first step in the Mancas model, features are extracted making use of Farneback s algorithm for optical flow [22]. The features are then discretized into 4 directions (north, south, west, east) and 5 speeds (very slow, slow, mean, fast, very fast). Next, a spatiotemporal low-pass filter is used to cope with the discretized feature channels and its purpose is to keep just the consistent spatio-temporal events and provide a spatio-temporal description of the feature. After the filtering of each of the 9 feature channels (4 directions, 5 speeds), a histogram with 5 bins is computed for each resulting image and the self-information of the pixels for a given bin is computed. Self-information provides a pixel saliency index and the different feature channels are fused using the maximum operator. The final saliency map specifies the rarity of the statistics of a given video volume at several scales. 3 Human attention versus dynamic saliency models 3.1 ASCMN (Abnormal Surveillance Crowd Moving Noise) dataset The comparison of automatic saliency models first needs ground truth data. For overt attention and the detection of eye fixation positions, the ground truth of human attentional behaviour can be measured using an eye tracking device. Eye tracking devices let us acquire for several users the precise position of people gaze at a video-based frame rate (30 frames per second).

4 4 Authors Suppressed Due to Excessive Length There are very few databases containing both video and eye tracking data. The most well-known is CRCNS-ORIG Itti s video database [23]. This database contains 50 videos along with eye-tracking data for 8 viewers. The DIEM dataset [24] contains 85 videos and eye data collected from 250 participants but who did not necessarily view all the videos. In [25] an eye-tracking database with 12 standard videos used in image compression and quality estimation viewed by 15 people, was proposed. Finally, the Lubeck dataset [26] contains, in addition to static images, 18 videos for 54 subjects. All these databases contain some videos with different kind of motion in the scene ( the sequences are extracted mostly from TV programs, Hollywood movies, standard video databases or video games). However none of them has been designed specifically to contain anomalous motion which would attract attention in the presence of other motion and enable testing of dynamic-saliency models. The proposed ASCMN database [17] attempts to fill this gap. It contains videos obtained from the Itti s CRCNS database, Vasconcelo s database [27] and a standard complex-background video surveillance database [19]. These have been extended with Internet and proprietary crowd movies. The database is divided into 5 classes of movies as described in Table 1. In addition to videos for which eye tracking data has previously been published in existing databases, the classes cover a new type of videos, lacking in the other databases - videos that contain motion abnormalities and crowd motion. Also eye-tracking data for complex-background surveillance videos included in ASCMN has not previously been published. Sample frames for different classes of videos are shown in Figure 1. The ASCMN database, therefore, provides data which covers a wider spectrum of video types, than the existing databases, and accumulates previously published videos suitable for dynamic saliency model evaluation. ASCMN contains 24 videos, together with eye tracking data from 13 viewers, acquired using a commercial FaceLab eye tracking system [28]. Table 1. The 5 classes of videos contained into the ASCMN database Video classes Description Videos Nb. 1) Containing abnormal motion (ABNORMAL) 2) Video surveillance style (SURVEILLANCE) 3) Crowd motion (CROWD) 4) Videos with moving camera (MOVING) 5) Motion noise with sudden salient motion (NOISE) Some moving blobs have different speed or direction compared to the main stream: Figure 1 line 1. Classical surveillance camera with no special motion event: Figure 1 line 2. Motion of more or less dense crowds: Figure 1 line 3. Videos taken with a moving camera: Figure 1 line 4. No motion during several seconds followed by sudden important motion: Figure 1 line 5. 2, 4, 16, 18, 20 1, 3, 5, 9 8, 10, 12, 14, 21 6, 19, 22, 24 7, 11, 13, 15, 17, 23

Author Guidelines for ACCV Final Paper 5 Fig. 1. The 5 classes of videos from the ASCMN database. First line represents AB- NORMAL motion with bikes and cars faster than people for example.

5 Author Guidelines for ACCV Final Paper 5 Fig. 1. The 5 classes of videos from the ASCMN database. First line represents AB- NORMAL motion with bikes and cars faster than people for example. The second line shows SURVEILLANCE classical motion with nothing really salient in terms of movement. The third line shows CROWD motion with increasing density from left to right. Line four shows MOVING camera videos. Line five displays videos with long periods of NOISE (frames 2, 4) and a sudden appearance of a salient object (frames 1, 3). This system allows small head movements and is thus less intrusive than other eye tracking systems, making the viewer feel more comfortable. The viewers are PhD students and researchers ranging from 23 to 35 years old, both males and females. The eye gaze positions are recorded and superimposed on the initial video for all the viewers, as can shown in the first column of Figure 3. We then low pass filter subjects gaze positions to obtain a heatmap which can also be superimposed on corresponding video frame (Figure 3, second column). This post-processing step is useful in estimating the mean gaze density, eliminating the outliers and providing higher weight to the focus points common to several users. Finally, depending on the metric used to assess the correspondence between the eye tracking results and an automatic saliency model, a thresholded version of the heatmap can be obtained (Figure 3, third column).

6 Authors Suppressed Due to Excessive Length Fig. 2. Left column: aggregated eye tracking results: each red dot is the position of the eye gaze of one viewer.

6 6 Authors Suppressed Due to Excessive Length Fig. 2. Left column: aggregated eye tracking results: each red dot is the position of the eye gaze of one viewer. The middle column contains smoothed gaze location producing heatmpas superimposed to the corresponding frame. Right column shows a thresholded version of the heatmap. 3.2 Comparison metrics for dynamic saliency models Several similarity measures can be used to compare saliency models with the eye tracking data. We choose three of them, due to their complementary nature: the Normalized Scanpath Saliency (NSS) [29], the Correlation Coefficients (CC) and the area under the Receiver Operating Characteristics curve (AUC-ROC) [30]. These measures have been designed to use fixations, but this approach is not optimal with video data, because subjects gaze does not correspond precisely to objects in motion (due to lag or anticipation, see section 4.1) and there are few fixations per frame (several per second). Thus, we propose using fixation heatmaps instead of the fixations themselves, to conduct the comparison. The goal of choosing more than one comparison metric is to ensure that the discussion about the results is as independent as possible from the choice of the metrics. The result of the different metrics is not necessarily the same, but when two metrics show similarities, the interpretation is more robust.

7 Author Guidelines for ACCV Final Paper 7 For the initial comparison, we chose the NSS metric, which is the average of the response values at human eye positions in a model s saliency map. To do this, we needed to normalize the saliency map, to have zero mean and unit standard deviation. The eye tracking map can then be thresholded to become a binary mask as shown in Figure 3, third column. For the second, CC metric, the same procedure was used. To calculate both metrics, we used the freely available Matlab functions implemented by Ali Borji [31]. For NSS and CC a selective threshold of 0.7 was chosen for the eye heatmap. It is empirically set low enough to provide reasonably large regions but high enough to avoid potentially uninteresting regions: those two measures mainly validate the presence of high salience in the areas of high eye gaze density. The final evaluation is made using AUC ROC, the area under the ROC curve. In this metric, we normalize the saliency map and use the density eye tracking map ( heatmap in Figure 3, second column) directly. The saliency map is then thresholded to create a binary mask that separates the positive samples from the negative. Contrary to the NSS and CC measures, this threshold is much less selective and just intends to eliminate the areas containing no fixations. Indeed, the ROC measure mainly focuses on the order of fixations and position, and not on the fixation density. By thresholding the saliency map using several different thresholds and plotting True Positive Rate (TPR) versus False Positive Rate (FPR), a ROC curve is obtained. Once the ROC curve is computed, the Area Under the Curve (AUC) can be calculated. To compute this metric, we used freely available Matlab functions implemented by Brian Lau [32]. 4 Results 4.1 Comparison issues and data processing The comparison of static saliency models working with still images is a difficult task. Indeed, the nature of the data obtained from eye tracking and the saliency maps is different and the saliency maps themselves can be very different (in terms of spatial resolution for example). In addition, parameters that control the specific processes within the saliency estimation algorithms, like smoothing, can significantly influence the results and the quantitative results. For dynamic data (videos) and dynamic saliency maps, the task is even more challenging than for still images. Other problems due to object motion and eye tracking are compounded to the common comparison issues. The main issue is in the fact that rapid motion is not followed precisely by people s gaze and that the gaze position (fixation) sometimes lags behind the real object they subjects are tracking (in cases of high-speed objects) or moves in advance of the object, when people try to anticipate the object trajectory. This means that the maximum gaze density is not necessarily focused on the tracked object, but it could lay somewhere in its surroundings. Thus, the saliency maps with high spatial resolution, which are very well focused on the tracked object, will be penalized. To avoid the effects of lag or advancement, and have a fair relative comparison of the different dynamic saliency models, we propose the use of

8 Authors Suppressed Due to Excessive Length Fig. 3. Preprocessing of the data to provide a fair comparison between the models. novel saliency map preprocessing methodology.

8 8 Authors Suppressed Due to Excessive Length Fig. 3. Preprocessing of the data to provide a fair comparison between the models. novel saliency map preprocessing methodology. The initial saliency maps of the four dynamic models can be seen in the left column of Figure 3. When the saliency maps are thresholded, as is done for the NSS and CC metrics, one can observe (second column of Figure 3) that the sizes of the detected objects are very different. We, therefore, performed mathematical morphology operations and smoothing on the four saliency maps to make them all have the same object size. The result is visible in Figure 3, third column, while the forth column shows the eye tracking data after thresholding. The moving objects sizes are at this point identical for all four saliency models. The morphological operations and smoothing parameters were chosen in order to maximize each model s results in terms of NSS, CC and AUC-ROC. Those parameters ensure both the optimal evaluation for each model and their relative fairness. In addition, as substantial shifts between the tracked target and the real gaze position are observed, using the precise fixation position has much less meaning than in the case of still images. Thus, we dilated the fixation points to larger disks before producing the heatmaps. Due to the heterogeneous features used by the different saliency algorithms, an additional problem for the comparison of saliency maps and video eye scanpaths became apparent. Algorithms which do not use static cues will perform worse on frames where there is no visible motion than the others. Therefore, the mean score for the entire video will be influenced by the number of quasi-static frames within the video.

9 Author Guidelines for ACCV Final Paper Comparison results Here, we compare the four models presented in Section 2, based on the eye tracking data collected for the ASCMN database, using the comparison metrics discussed in section 3 and the preprocessing described in the previous section. In addition to the four models described in section 2, a comparison with a simple frame to frame differencing algorithm was performed. This algorithm uses the absolute difference between two successive frames to detect moving objects. We will call this approach BASELINE in the following sections. The baseline is used to compare the saliency models with a very simple motion detection algorithm to evaluate the advantages of the saliency models over motion detection alone. The whole set of results is summarized in Figure 5 and in Tables 2 and 3. The first line of the Figure 5 shows the results obtained for the CC metric, the second line for NSS and the third line for the ROC, for each video class taken separately and together. For ABNORMAL videos, Mancas model provides the best results, followed by the Culibrk model. While the baseline has good results for CC and NSS, it drops with the AUC-ROC metric below all the other models. For SURVEILLANCE videos, Culibrk model works the best according to CC and NSS metrics, while Mancas model is the best according to AUC-ROC. All the models do better than the baseline. For the CROWD videos, Culibrk model is the best according to AUC-ROC metric, while Mancas model is the best according to CC and NSS metrics. The baseline here is very efficient as the main cue is in motion itself. For the MOVING videos, the model of Seo is the best according to the AUC-ROC metric while Mancas model is the best according to CC and NSS metrics. For NOISE videos Mancas performs better for AUC-ROC while it is outperformed by Culibrk for CC measures and by Culibrk and Seo for NSS. Finally, the results for the whole dataset are aggregated providing the mean and standard error (SE) [33]. Mancas model is best, closely followed by Culibrk model. It is interesting to note that the Seo model does not work very well here, as it is outperformed by the baseline for two of the three metrics. 5 Discussion and conclusion We proposed a framework for evaluating dynamic-saliency models, based on a novel video database, which is publicly available [17]. The database contains 5 classes of videos covering a wide spectrum of realistic situations, but is designed specifically for evaluating dynamic saliency models. To the best of our knowledge, the proposed database has the largest coverage of such real-life videos. Four dynamic saliency models integrating both static and dynamic features, or only dynamic features in the case of Mancas model, were tested using three different metrics on this database. They were also compared with a basic motion detection model, which was considered the baseline. The comparison was performed both separately, for each of the 5 video classes, and globally for the whole database. Based on those comparisons several conclusions can be drawn.

10 10 Authors Suppressed Due to Excessive Length Fig. 4. Left images show high fixation dispersion, the two other show low dispersion. First, when compared to eye tracking data, the saliency models are better than the baseline which only detects motion. There is an exception to this general rule when the class of CROWD videos is concerned, where the baseline provides good results. In crowds, it seems that motion is of huge importance. As people are small, it is difficult for humans to spot the difference between the kinds of motion, and areas within the crowds with high motion will naturally attract attention. There are probably too many distractors for the eye to detect small groups of people behaving differently. The NOISE videos proved to be difficult for all the models. This is due to the fact that, in a significant part of the video, nothing special happens and only motion noise is present as shown in the two left frames of Figure 4. In this case it is impossible to use a saliency prediction model: human fixations dispersion is too high, with no common tendency. As expected, the saliency models perform best on the ABNORMAL videos, and two of the four models claim a high selectivity in the detection of the motion of interest, namely Seo and Mancas models. Seo is the most selective, which is very good if the selected blob fits with the one attracting people gaze, but, in the contrary, Seo s model is highly penalized. As we use the mean score within the video, Seo s model score dramatically decreases each time that the gaze target is missed, which provides a mediocre overall score for the whole video. Mancas algorithm is also selective but less than the one of Seo, so it is not penalized as severely when it misses the target, since the other moving blobs are not completely ignored. The interesting conclusion is that a saliency model on a video should be, of course, selective, but not too much to get good mean results. Finally, we identified a lot of issues in achieving a fair comparison, as the eye gaze is not necessarily always superimposed on the target. Before the preprocessing step performed in section 4.1, Culibrk model was highly penalized by its too high spatial resolution. But after the preprocessing step, the nature of the saliency maps begun to converge, which led to easier comparison between those maps. We used several metrics and well known implementations of those metrics to have complementary scores and not rely on any one metric. Four models of bottom-up dynamic saliency detection were compared on a database containing a wide variety of video classes. None of them is really outstanding in all the situations and they actually seem complimentary depending on the video class, even if Mancas and Culibrk models have the higher overall score. Seo is also interesting as it is highly selective while SUN s performance is below the others, probably due to its disregard for the motion cues.

11 Author Guidelines for ACCV Final Paper 11 Table 2. Results for the 5 classes of videos and 3 comparison metrics Video classes 1) Containing abnormal motion (ABNORMAL) 2) Video surveillance style (SURVEILLANCE) 3) Crowd motion (CROWD) 4) Videos with moving camera (MOVING) AUC-ROC mean Mancas: 74% Culibrk: 73% SEO: 71% SUN: 68% Baseline: 65% Mancas: 72% SEO: 70% SUN: 68% Culibrk: 65% Baseline: 63% Culibrk: 68% Baseline: 65% Mancas: 65% SEO: 61% SUN: 60% SEO: 63% Culibrk: 63% Mancas: 60% SUN: 59% Baseline: 56% CC mean Mancas: 0.28 Culibrk: 0.25 SUN: 0.24 SEO: 0.23 Baseline: 0.23 Culibrk: 0.27 SUN: 0.25 SEO: 0.20 Mancas: 0.19 Baseline: 0.16 Mancas: 0.18 Baseline: 0.15 Culibrk: 0.13 SUN: 0.10 SEO: 0.07 Mancas: 0.17 SEO: 0.14 SUN: 0.13 Culibrk: 0.10 Baseline: 0.08 NSS mean Mancas: 1.63 Culibrk: 1.43 SUN: 1.40 SEO: 1.39 Baseline: 1.35 Culibrk: 1.68 SUN: 1.51 SEO: 1.23 Mancas: 1.18 Baseline: 1.02 Mancas: 0.95 Baseline: 0.82 Culibrk: 0.73 SUN: 0.57 SEO: 0.38 Mancas: 0.99 SEO: 0.81 SUN: 0.74 Culibrk: 0.58 Baseline: ) Motion noise with sudden salient motion (NOISE) Mancas: 67% Culibrk: 65% SUN: 62% Baseline: 61% SEO: 60% Culibrk: 0.19 Mancas: 0.17 SUN: 0.16 Baseline: 0.14 SEO: 0.11 Culibrk: 1.07 SUN: 0.97 Mancas: 0.95 Baseline: 0.77 SEO: 0.67 Table 3. Results for the whole dataset of videos and the 3 comparison metrics Models AUC-ROC mean (+/ SE) CC mean (+/ SE) NSS mean (+/ SE) Mancas (+/ ) (+/ ) (+/ ) Culibrk (+/ ) (+/ ) (+/ ) SUN (+/ ) (+/ ) (+/ ) SEO (+/ ) (+/ ) (+/ ) Baseline (+/ ) (+/ ) (+/ )

12 Authors Suppressed Due to Excessive Length Fig. 5. Correlation NSS and ROC for the baseline and the four dynamic saliency methods for each of the five video categories.

12 12 Authors Suppressed Due to Excessive Length Fig. 5. Correlation NSS and ROC for the baseline and the four dynamic saliency methods for each of the five video categories. The bar graphs on bottom right show the mean and standard error (SE) for the whole dataset References 1. Itti, L., Koch, C.: Computational modelling of visual attention. Nature Reviews Neuroscience 2 (2001) Olveczky, B.P., Baccus, S.A., Meister, M.: Segregation of object and background motion in the retina. Nature 423 (2003) James, W.: The Principles of Psychology, Vol. 1. Dover Publications (1950) 4. Styles, E.A.: Attention, Perception, and Memory: An Integrated Introduction. Taylor & Francis Routledge, New York, NY (2005) 5. Wolfe, J.M.: Visual attention. In: Seeing, Academic Press (2000) Marques, O., Mayron, L.M., Borba, G.B., Gamba, H.R.: An attention-driven model for grouping similar images with image retrieval applications. EURASIP Journal on Advances in Signal Processing 2007 (2007) 7. Siagian, C., Itti, L.: Rapid biologically-inspired scene classification using features shared with visual attention. IEEE Transactions on Pattern Analysis and Machine Intelligence 29 (2007) Siagian, C., Itti, L.: Biologically inspired mobile robot vision localization. IEEE Transactions on Robotics 25 (2009) Ma, Y.F., Hua, X.S., Lu, L., Zhang, H.J.: A generic framework of user attention model and its application in video summarization. Multimedia, IEEE Transactions on 7 (2005)

13 Author Guidelines for ACCV Final Paper Culibrk, D., Mirkovic, M., Zlokolica, V., Pokric, M., Crnojevic, V., Kukolj, D.: Salient motion features for video quality assessment. Image Processing, IEEE Transactions on (2010) Itti, L.: Automatic foveation for video compression using a neurobiological model of visual attention. IEEE Transactions on Image Processing 13 (2004) Bruce, N., Tsotsos, J.: Saliency based on information maximization. Advances in neural information processing systems 18 (2006) Seo, H., Milanfar, P.: Static and space-time visual saliency detection by selfresemblance. Journal of Vision 9 (2009) Zhang, L., Tong, M., Marks, T., Shan, H., Cottrell, G.: Sun: A bayesian framework for saliency using natural statistics. Journal of Vision 8 (2008) Mancas, M., Riche, N., J. Leroy, B.G.: Abnormal motion selection in crowds using bottom-up saliency. In: IEEE ICIP. (2011) 16. Mahadevan, V., Vasconcelos, N.: Spatiotemporal saliency in dynamic scenes. Pattern Analysis and Machine Intelligence, IEEE Transactions on 32 (2010) Matei Mancas, N.R.: Attention website. ( Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (1998) Li, L., Huang, W., Gu, I., Tian, Q.: Statistical modeling of complex backgrounds for foreground object detection. In: IEEE Trans. Image Processing, vol. 13, pp (2004) 20. Rosin, L.: Thresholding for change detection. In: Proc. of the Sixth International Conference on Computer Vision (ICCV 98). (1998) 21. Hodge, V.J., Austin, J.: A survey of outlier detection methodologies. Artificial Intelligence Review 22 (2004) Farnebäck, G.: Two-frame motion estimation based on polynomial expansion. In: SCIA. (2003) Itti, L.: Crcns-orig video and eye tracking database. ( 24. Henderson, J.M.: Diem video and eye tracking database. ( Hadizadeh, H., Enriquez, M.J., Bajic, I.V.: Eye tracking database for a set of standard video sequences. accepted for publication in IEEE Trans. Image Processing (2011) 26. Dorr, M., Martinetz, T., Gegenfurtner, K.R., Barth, E.: Variability of eye movements when viewing dynamic natural scenes. Journal of Vision 10 (2010) 27. Mahadevan, V., Vasconcelos, N.: Spatiotemporal saliency in dynamic scenes. IEEE Transactions on Pattern Analysis and Machine Intelligence 32 (2010) Machine, S.: Facelab commercial eye tracking system. ( Peters, R.J., Iyer, A., Itti, L., Koch, C.: Components of bottom-up gaze allocation in natural images. Vision Research 45 (2005) Green, D.M., Swets, J.A.: Signal detection theory and psychophysics. Wiley, New York (1966) 31. Borji, A.: Evaluation measures for saliency maps: Cc and nss. ( Lau, B.: Evaluation measures for saliency maps: Auc roc. ( under roc curve) 33. Gurland, J., Tripathi, R.C.: A simple approximation for unbiased estimation of the standard deviation. 25 (1971) 30 32

Saliency Detection for Videos Using 3D FFT Local Spectra

Saliency Detection for Videos Using 3D FFT Local Spectra Zhiling Long and Ghassan AlRegib School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA ABSTRACT