Saliency Detection for Videos Using 3D FFT Local Spectra

Similar documents
UNDERSTANDING SPATIAL CORRELATION IN EYE-FIXATION MAPS FOR VISUAL ATTENTION IN VIDEOS. Tariq Alshawi, Zhiling Long, and Ghassan AlRegib

Fast and Efficient Saliency Detection Using Sparse Sampling and Kernel Density Estimation

Video saliency detection by spatio-temporal sampling and sparse matrix decomposition

Salient Region Detection and Segmentation in Images using Dynamic Mode Decomposition

A New Framework for Multiscale Saliency Detection Based on Image Patches

Motion statistics at the saccade landing point: attentional capture by spatiotemporal features in a gaze-contingent reference

Unsupervised Uncertainty Estimation Using Spatiotemporal Cues in Video Saliency Detection

Dynamic saliency models and human attention: a comparative study on videos

Saliency Extraction for Gaze-Contingent Displays

Nonparametric Bottom-Up Saliency Detection by Self-Resemblance

International Conference on Information Sciences, Machinery, Materials and Energy (ICISMME 2015)

A New Feature Local Binary Patterns (FLBP) Method

A Novel Approach for Saliency Detection based on Multiscale Phase Spectrum

Dynamic visual attention: competitive versus motion priority scheme

Image Resizing Based on Gradient Vector Flow Analysis

Saliency Estimation Using a Non-Parametric Low-Level Vision Model

Performance Evaluation Metrics and Statistics for Positional Tracker Evaluation

TOWARDS THE ESTIMATION OF CONSPICUITY WITH VISUAL PRIORS

Asymmetry as a Measure of Visual Saliency

Focusing Attention on Visual Features that Matter

Salient Region Extraction for 3D-Stereoscopic Images

Texture attributes for description of migrated seismic volumes: a comparative study

A computational visual saliency model based on statistics and machine learning

Contour-Based Large Scale Image Retrieval

A Novel Approach to Saliency Detection Model and Its Applications in Image Compression

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Main Subject Detection via Adaptive Feature Selection

Design & Implementation of Saliency Detection Model in H.264 Standard

Learning video saliency from human gaze using candidate selection

IMAGE SALIENCY DETECTION VIA MULTI-SCALE STATISTICAL NON-REDUNDANCY MODELING. Christian Scharfenberger, Aanchal Jain, Alexander Wong, and Paul Fieguth

Motion analysis for broadcast tennis video considering mutual interaction of players

A COMPARATIVE STUDY OF QUALITY AND CONTENT-BASED SPATIAL POOLING STRATEGIES IN IMAGE QUALITY ASSESSMENT. Dogancan Temel and Ghassan AlRegib

Static and space-time visual saliency detection by self-resemblance

Salient Region Detection using Weighted Feature Maps based on the Human Visual Attention Model

Visible and Long-Wave Infrared Image Fusion Schemes for Situational. Awareness

Experimentation on the use of Chromaticity Features, Local Binary Pattern and Discrete Cosine Transform in Colour Texture Analysis

Palm Vein Recognition with Local Binary Patterns and Local Derivative Patterns

Det De e t cting abnormal event n s Jaechul Kim

A Novel Image Transform Based on Potential field Source Reverse for Image Analysis

Spatio-Temporal Saliency Detection in Dynamic Scenes using Local Binary Patterns

COLOR FIDELITY OF CHROMATIC DISTRIBUTIONS BY TRIAD ILLUMINANT COMPARISON. Marcel P. Lucassen, Theo Gevers, Arjan Gijsenij

A Hierarchical Visual Saliency Model for Character Detection in Natural Scenes

Saliency Detection using Region-Based Incremental Center-Surround Distance

WATERMARKING FOR LIGHT FIELD RENDERING 1

A Novel Field-source Reverse Transform for Image Structure Representation and Analysis

Coding of 3D Videos based on Visual Discomfort

Small Object Segmentation Based on Visual Saliency in Natural Images

Tracking. Hao Guan( 管皓 ) School of Computer Science Fudan University

Extraction of Color and Texture Features of an Image

Random Walks on Graphs to Model Saliency in Images

Evaluation of regions-of-interest based attention algorithms using a probabilistic measure

Image Transformation Techniques Dr. Rajeev Srivastava Dept. of Computer Engineering, ITBHU, Varanasi

Texture Classification by Combining Local Binary Pattern Features and a Self-Organizing Map

Determining satellite rotation rates for unresolved targets using temporal variations in spectral signatures

Motion Estimation using Block Overlap Minimization

A Survey on Detecting Image Visual Saliency

CHAPTER 2 TEXTURE CLASSIFICATION METHODS GRAY LEVEL CO-OCCURRENCE MATRIX AND TEXTURE UNIT

Robot localization method based on visual features and their geometric relationship

Probabilistic Multi-Task Learning for Visual Saliency Estimation in Video

Detecting and Identifying Moving Objects in Real-Time

Trademark Matching and Retrieval in Sport Video Databases

Context based optimal shape coding

2. On classification and related tasks

Saliency detection for stereoscopic images

Artifacts and Textured Region Detection

Image Matching Using Run-Length Feature

A Novel Approach to Image Segmentation for Traffic Sign Recognition Jon Jay Hack and Sidd Jagadish

Combining PGMs and Discriminative Models for Upper Body Pose Detection

The Spectral Relation between the Cube-Connected Cycles and the Shuffle-Exchange Network

Hyperspectral and Multispectral Image Fusion Using Local Spatial-Spectral Dictionary Pair

Robust and Efficient Saliency Modeling from Image Co-Occurrence Histograms

Robotics Programming Laboratory

DETECTION OF IMAGE PAIRS USING CO-SALIENCY MODEL

CHAPTER 6 IDENTIFICATION OF CLUSTERS USING VISUAL VALIDATION VAT ALGORITHM

How Important is Location in Saliency Detection

A Background Modeling Approach Based on Visual Background Extractor Taotao Liu1, a, Lin Qi2, b and Guichi Liu2, c

Intensity Transformations and Spatial Filtering

A Neural Network for Real-Time Signal Processing

Based on Regression Diagnostics

Image and Video Quality Assessment Using Neural Network and SVM

Wavelet based Keyframe Extraction Method from Motion Capture Data

Bit-Plane Decomposition Steganography Using Wavelet Compressed Video

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 1, JANUARY

Character Recognition

arxiv: v1 [cs.cv] 28 Sep 2018

GAZE TRACKING APPLIED TO IMAGE INDEXING

Computers and Mathematics with Applications. An embedded system for real-time facial expression recognition based on the extension theory

Training-Free, Generic Object Detection Using Locally Adaptive Regression Kernels

A Phase Discrepancy Analysis of Object Motion

DYNAMIC BACKGROUND SUBTRACTION BASED ON SPATIAL EXTENDED CENTER-SYMMETRIC LOCAL BINARY PATTERN. Gengjian Xue, Jun Sun, Li Song

Project and Production Management Prof. Arun Kanda Department of Mechanical Engineering Indian Institute of Technology, Delhi

Salient Region Detection and Segmentation

Automatic Shadow Removal by Illuminance in HSV Color Space

OCCLUSION BOUNDARIES ESTIMATION FROM A HIGH-RESOLUTION SAR IMAGE

Countermeasure for the Protection of Face Recognition Systems Against Mask Attacks

What do different evaluation metrics tell us about saliency models?

Storage Efficient NL-Means Burst Denoising for Programmable Cameras

PERFORMANCE ANALYSIS OF COMPUTING TECHNIQUES FOR IMAGE DISPARITY IN STEREO IMAGE

Weka ( )

Image Retargeting for Small Display Devices

Transcription:

Saliency Detection for Videos Using 3D FFT Local Spectra Zhiling Long and Ghassan AlRegib School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA ABSTRACT Bottom-up spatio-temporal saliency detection identifies perceptually important regions of interest in video sequences. The center-surround model proves to be useful for visual saliency detection. In this work, we explore using 3D FFT local spectra as features for saliency detection within the center-surround framework. We develop a spectral location based decomposition scheme to divide a 3D FFT cube into two components, one related to temporal changes and the other related to spatial changes. Temporal saliency and spatial saliency are detected separately using features derived from each spectral component through a simple center-surround comparison method. The two detection results are then combined to yield a saliency map. We apply the same detection algorithm to different color channels (YIQ) and incorporate the results into the final saliency determination. The proposed technique is tested with the public CRCNS database. Both visual and numerical evaluations verify the promising performance of our technique. Keywords: saliency, video, 3D FFT, local spectra 1. INTRODUCTION Bottom-up spatio-temporal saliency detection identifies perceptually important regions of interest in video sequences. Such detection may help video processing (e.g., coding, classification, summarization, etc.) to be more effective and efficient. The center-surround model proves to be useful for visual saliency detection. 1 According to this model, for an object to be viewed as salient in a visual scene, it has to stand out from the surrounding environment. To translate this into a computational model, features characterizing the object and its surroundings may be extracted. A comparison of the differences between the features of the object (i.e., the center) and the environment (i.e., the surround) can then be performed to evaluate the saliency of a certain object. In other words, the contrast of features is used as a measure of the visual saliency. Most existing saliency detection techniques extract features in the time-space domain. In this research, we are interested in exploring with frequency domain local features, in an attempt to capture the intrinsic patterns of the localized changes of the signal. For videos, the classical 3D fast Fourier transform (FFT) is no doubt the most basic tool for examining the signals in the frequency domain. However, although 2D FFT has been applied to saliency detection for images with success, 2, 3 we notice that 3D FFT has rarely been used directly to detect saliency for videos. We believe that there is a reason behind this phenomenon. As we know, the human visual system (HVS) may respond to temporal changes and spatial changes in a scene in different ways. Based on specific situations, it may also pay more attention to one type of changes than to the other. As a typical example, people tend to be attracted more to a moving object than to a static background it moves against. As such, it is reasonable to evaluate temporal saliency and spatial saliency separately using different methods, and then combine them possibly in a weighted manner. 3D FFT captures changes in the space-time domain with a 3D spectrum. If used directly, this 3D spectrum does not differentiate between temporal changes from spatial changes. As we verified in our study, employing a whole 3D spectrum as features does not yield satisfactory results. However, if we can find an approach to separate the two types of changes mixed in a 3D FFT spectrum, we may still accomplish an effective saliency detection for 3D signals like videos. If we succeed, there will be Further author information: Z. Long: E-mail: zhiling.long@gatech.edu G. AlRegib: E-mail: alregib@gatech.edu

(a) (b) Figure 1. Illustration of (a) spectral locations and (b) spectral decomposition. at least two obvious advantages with such an algorithm: 1) Application of 3D FFT to videos does not involve complicated parameter settings, which is a common shortcoming many available techniques suffer from. 2) FFT calculation can be very efficient and fast, with hardware implementation support available. Therefore, in this paper, we develop a scheme that decomposes a 3D FFT spectrum into two components, one being related to temporal changes and the other related to spatial changes. Based on this, we develop a saliency detection algorithm that utilizes a very simple center-surround comparison of features derived from the decomposed spectral components. Both temporal and spatial saliency are combined, from all available color channels, to yield the final saliency maps. We note that our features are extracted by applying 3D FFT to localized time-space cubes obtained via sliding windows, while in several previous techniques, FFT was typically utilized in a global manner in 2D form. 2, 3 2.1 3D FFT Local Spectral Features 2. METHODS In this paper, we develop a scheme that decomposes a 3D FFT spectrum into two parts: a component related to temporal changes and a component related to spatial changes. Our decomposition is based on spectral locations. Our reasoning for this approach is illustrated in Fig. 1. Within a 3D spectral cube in a f x -f y -f t coordinate system, if a spectral point is closer to the f t -axis, then it is more related to the temporal changes. While a spectral point closer to the f x -f y plane is associated more with the spatial changes. Therefore, we decompose a spectral point into the two associated components by applying two different weights calculated according to its spectral location within the cube. As illustrated in Fig. 1, for a given spectral point F (a 0, b 0, c 0 ), we calculate its temporal component by projecting the point onto the f t -axis as F t (a 0, b 0, c 0 ) = F (a 0, b 0, c 0 ) sinθ = F (a 0, b 0, c 0 ) MN OM = F (a c 0 0, b 0, c 0 ) a 2 0 + b 2 0 +. (1) c2 0 Similarly, we obtain its spatial component by projecting onto the f x -f y plane using the following equation: F s (a 0, b 0, c 0 ) = F (a 0, b 0, c 0 ) cosθ = F (a 0, b 0, c 0 ) ON a 2 OM = F (a 0, b 0, c 0 ) 0 + b 2 0 a 2 0 + b 2 0 +. (2) c2 0 Thus, the 3D FFT cube is decomposed into two cubes of exactly the same size, F t and F s, respectively. We then calculate the absolute mean over each cube to obtain the corresponding features, E t and E s, based on which temporal saliency and spatial saliency may be determined through the center-surround comparison. Fig.2 presents examples which demonstrate the effectiveness of our features. Obviously, The E t values help capture the motions in the video, while the E s values reflect the spatial complexity of the signal.

(a) (b) (c) Figure 2. Examples demonstrating the effectiveness of our 3D FFT local spectra based features: (a) Original video frame with eye fixation ground truth (small colored squares on the two moving cars); (b) Et values overlaid on the frame (red and yellow denotes higher values); and (c) Es values overlaid on the frame. The equations above for spectral decomposition do not work for a special case, i.e., a0 = b0 = c0 = 0. This is the location for the center of the spectral cube. As we understand, the spectral center is associated with the DC component of the spectrum, which does not reflect any changes in either the space domain or the time domain. Therefore, we do not include the spectral center in calculating either one of the energy features. 2.2 Center-surround Based Saliency Detection To perform center-surround based saliency detection, we follow a specific center-surround model based framework, i.e., saliency detection by self-resemblance (SDSR).4 However, as described below, we fit into the general framework our own feature extraction and comparison schemes. We first apply the 3D FFT to localized timespace cubes obtained by using sliding windows. Then we decompose each 3D FFT cube into its two components as described above. Energy features {Et } and {Es } are then extracted from the decomposed components. Comparison of features is a key step to the center-surround model. Because of the simplicity of our features, which are in the form of a scalar, we are able to develop a simple yet effective comparison scheme as follows. Assume a W H L video signal, with W being the frame width, H frame height, and L the total number of frames. For the temporal components computed over all the video frames, we have a set of features {Et (i, j, k)}, where (i, j, k) is the center coordinates of each localized data cube (i = 1, 2,..., H; j = 1, 2,..., W ; k = 1, 2,..., L). We first normalize the features within each frame, obtaining a normalized set {Etn (i, j, k)}. Here, Etn (i, j, k) = Et (i, j, k) Etmin (k), Etmax (k) Etmin (k) (3) and Etmin (k), Etmax (k) are the minimum and maximum values within each frame, respectively. After the normalization, saliency values for each location are determined by comparing the center against its surrounding neighbors. 1 X St (i, j, k) = Etn (i, j, k) Etn (i + i0, j + j0, k + k0 ) (4) N i0,j0,k0 Here, i0, j0, k0 are chosen such that point (i + i0, j + j0, k + k0 ) is in the immediate neighborhood of point (i, j, k), such as within a 3 3 3 window centered at (i, j, k). N is the total number of points included in the summation. Spatial saliency Ss (i, j, k) is calculated similarly. Once the temporal saliency and the spatial saliency are determined, the two maps can be combined to yield a final detection map. There are different ways available for the combination. In this paper, we do not explore in this aspect as our focus is on the feature extraction. We simply average the two maps to combine them together. As will be seen in the experiments, this already yields promising results.

Figure 3. Examples applying our saliency detection algorithm to differnt color channels. (top) Original video frame with eye fixation on the red helmet; (bottom) Saliency detection results with differnt color channels. First row: Temporal, Y; temporal, I; temporal, Q. Second row: Spatial, Y; spatial, I; spatial, Q. 2.3 Colors As commonly accepted, colors play very important role in saliency detection. Typically, color-related features are extracted in the space domain.5 In this paper, we apply the same saliency detection algorithm to different color channels, assuming that temporal and spatial changes can both be discerned in all channels. We employ the YIQ color model for our initial study. As a result, we obtain altogether three saliency maps, one from each color channel. In this paper, we simply average the three to obtain the final saliency map. As shown in Fig. 3, the different color channels may complement each other, making the final detection more reliable. 2.4 Evaluation To evaluate the saliency detection performance, both visual inspection and objective assessment are performed. For the latter, two metrics are employed, i.e., the area under the receiver operating characteristics (ROC) curve (AUC) and the Kullback-Leibler divergence (KLD).6 Both metrics are based on the detection theory. First of all, two data sets are created from the saliency detection results. One set is about the true detections (TD), which contains saliency values for the human eye fixation positions from the ground truth. The other set is about the false positives (FP), which consists of saliency values for random positions other than the fixations. When a threshold is set for the saliency values, rates of true detection and false positive can be calculated using each set, respectively. By sweeping through the whole range of possible threshold values, pairs of true detection rate and false positive rate are obtained to yield an ROC curve. The area under the curve is the AUC metric. To find

Table 1. Objective evaluation of saliency detection performance using the shuffled metric. Algorithm Chance Bayesian Surprise SUNDAy Proposed AUC 0.5 0.581 0.582 0.595 KLD 0 0.034 0.041 0.184 the KLD, the two sets (TD and FP, without applying thresholds) are considered as two random distributions. Histograms are generated for each set and then compared to yield the KLD. For both metrics, a greater value indicates better performance. AUC is ranged from 0 to 1, while KLD does not have an upper bound. The most common method for generating the FP set is to randomly sample a saliency map. As revealed in a recent research, 7 such sampling introduces a so-called center bias into the evaluation results. To avoid this bias, a modified sampling method was introduced 7 and well accepted. Denoted as shuffled metrics, the AUC and KLD are computed with FP set formed by using locations of human eye fixation in a different (or, shuffled) frame of the same video. We use this shuffled scheme to generate AUC and KLD in our experiments. 3. EXPERIMENTS We test our algorithm with the public CRCNS database. 8 The database includes 50 videos, with the duration ranging from 5 to 90 seconds. The video contents are of diversified nature, covering street scenes, TV sports, TV news, TV talks, video games, etc. Ground truth data are provided with each video, which are eye fixations recorded for a group of human subjects watching the videos under freeview condition. For our experiments, we calculated 3D FFT on 5 5 5 cubes. Non-overlapping windows were used in the space domain. Before FFT, we also downsampled the original video frames into 160 120 in size. For center-surround feature comparison, a window size of 3 3 3 was adopted. We compare the performance of our algorithm against those of the Bayesian Surprise 9 and the SUNDAy, 7 as shown in Table 1. We also include Chance as a baseline in the comparison, which means saliency detection is performed by random guessing. We note that the results for Bayesian Surprise and SUNDAy are the numbers reported in the SUNDAy work. 7 It appears that our numerical results are better. In Fig. 4, we provide some example saliency maps for visual examination. As observed, although the overall visual appearances are comparable, our method tends to identify the true target of attention better than the Bayesian Surprise. 4. CONCLUSIONS In this paper, we developed a saliency detection algorithm for videos using features based on 3D FFT local spectra. From our work, we can draw the following conclusions: 1) Separation of spectral components related to temporal and spatial changes is important, and can be accomplished by the location based spectral decomposition method we developed. 2) The spectral decomposition based approach can also be applied to different color channels, making it unnecessary to process each channel differently. 3) The algorithm outperforms the popular techniques in our experiments. At the same time, due to its simple feature extraction and comparison schemes, the algorithm is promising especially in its efficiency and scalability. In the future, we plan to study various approaches for combining the temporal and spatial saliency maps. We may also explore how to incorporate local phase information to improve the detection performance. REFERENCES [1] Gao, D., Mahadevan, V., and Vasconcelos, N., On the plausibility of the discriminant center-surround hypothesis for visual saliency, Journal of Vision 8(7), 1 18 (2008). [2] Hou, X. and Zhang, L., Saliency detection: A spectral residual approach, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2007). [3] Guo, C. and Zhang, L., A novel multiresolution spatiotemporal saliency detection model and its applications in image and video compression, IEEE Transactions on Image Processing 19(1), 185 198 (2010).

(a) (b) (c) Figure 4. Example results for saliency detection: (a) Original video frame with eye fixations (denoted by small colored squares); (b) Saliency by Bayesian Surprise; (c) Saliency by our method. [4] Seo, H. and Milanfar, P., Static and space-time visual saliency detection by self-resemblance, Journal of Vision 9(12), 1 27 (2009). [5] Itti, L., Automatic foveation for video compression using a neurobiological model of visual attention, IEEE Transactions on Image Processing 13(10), 1304 1318 (2004). [6] Borji, A., Sihite, D. N., and Itti, L., Quantitative analysis of human-model agreement in visual saliency modeling: a comparative study, IEEE transactions on image processing 22(1), 55 69 (2013). [7] Zhang, L., Tong, M. H., and Cottrell, G. W., SUNDAy: Saliency using natural statistics for dynamic analysis of scenes, Proceedings of the 31st Annual Cognitive Science Conference (2009). [8] Itti, L., Quantifying the contribution of low-level saliency to human eye movements in dynamic scenes, Visual Cognition 12(6), 1093 1123 (2005). [9] Itti, L. and Baldi, P. F., Bayesian surprise attracts human attention, Vision Research 49(10), 1295 1306 (2009).