Adaptive Fusion of Human Visual Sensitive Features for Surveillance Video Summarization

Size: px

Start display at page:

Download "Adaptive Fusion of Human Visual Sensitive Features for Surveillance Video Summarization"

Archibald Williams
5 years ago
Views:

1 Research Article Journal of the Optical Society of America A 1 Adaptive Fusion of Human Visual Sensitive Features for Surveillance Video Summarization MD. MUSFEQUS SALEHIN 1,* AND MANORANJAN PAUL 1 1 School of Computing and Mathematics, Charles Sturt University, Bathurst, NSW 2795, Australia * Corresponding author: msalehin@csu.edu.au Compiled March 31, 2017 Surveillance video camera captures a large amount of continuous video stream every day. To analyze or investigate any significant events, it is laborious and boring job to identify these events from the huge video data if it is done manually. Existing approaches sometime neglect key frames with significant visual contents and/or select some unimportant frames with low/no activity. To solve this problem, in the paper a video summarization technique is proposed by combining three multi-modal human visual sensitive features such as foreground objects, motion information and visual saliency. In a video stream, foreground objects are one of the most important content of a video as they contain more detail information and play a major role for important events. Moreover, motion is another stimulus of a video which attracts human visual attention significantly. To obtain this, motion information is calculated in spatial as well as frequency domain. Spatial motion information can select object motion accurately; however it is sensitive to illumination changes. On the other hand, frequency motion information is robust to illumination change, although it is easily affected by noise. Therefore, motion information both in spatial and frequency domain is employed. Furthermore, visual attention cue is a sensitive feature to measure the indication of user s attraction label for determining key frames. As these features individually cannot perform every well, they are combined to obtain better results. For this propose, an adaptive linear weighted fusion scheme is proposed to combine the features to rank video frames for summarization. Experimental results reveal that the proposed method outperforms the state of the art method Optical Society of America OCIS codes: ( ) Lasers, distributed feedback; ( ) Fibers, polarization-maintaining;( ) Fiber Bragg gratings INTRODUCTION Every day an enormous amount of surveillance video is captured throughout the whole world for providing security, monitoring, preventing crime, controlling traffic, and so on. In general, a number of surveillance video cameras are set up in a number of different places of a building, business area, or congested area and these cameras are connected to a monitoring server for storing and investigating. To store this huge volume of video data requires tremendous memory space. In addition to this, to find out any important event from the stored video for investigating or performing analysis, operators need to access the store videos. This process is very tedious, time consuming and not cost effective. To solve these problems, a method for generating the shorter version of original video containing important events is highly desirable for memory management and information retrieval. Video summarization is the process or technique to select the most informative frames so that it can contain all the necessary events or information of a long video and reject unnecessary frames to make the summarized video as concise as possible. Therefore, a good video summarization method is one that has several important properties. First, it must have the capability to produce a video with all significant incidents within the original video. Second, it should be able to generate a smaller version of the provided long video. Third, it should not contain repetitive information. The main purpose of video summarization is to represent a long original video in a condensed version in such a way that a user can get the overall idea of the events occurred in the entire video within a constrained amount of time. Existing approaches sometime neglect key frames with significant visual contents and/or select some unimportant frames with low/no activity. In a video stream, foreground objects are one of the most important content of a video as they contain more detail information and play a major role for important events [1]. A frame with large size of foreground objects is more informative than that with less or no foreground objects. More-

2 Research Article Journal of the Optical Society of America A 2 over, motion is another stimuli of a video which attracts human visual attention significantly [2]. Human activity, which is also an important content of a video, can be easily represented by motion information. Furthermore, visual attention cue is a sensitive feature to measure the indication of user s attraction label for determining key frames [3]. This attention information is used to monitor the human perception system for understanding the content of the video [3]. Motivated by above-mentioned findings, a video summarization scheme is proposed in this paper based on foreground objects, their motion, and visual attention cue in a video. The frame with an absolute larger foreground area has the higher probability to present an important event. To include the absolute size of the foreground object areas, Gaussian mixture-based parametric dynamic background modeling (DBM) [4] has been applied in the proposed approach. However, not all frames with larger foreground areas within an event should be selected as key frames if they are similar and have no/little motion among them compared to background scene. To acquire the complete information of object motion in a video, object motion is extracted not only in spatial domain but also in frequency domain. Although spatial motion information is able to indicate object motion related to important event, it is sensitive to illumination changes [5][6] and it provides over estimated motion areas comprising object areas, uncovered background areas and occluded areas compared to background scene. On the other hand, motion information in frequency domain is invariant to illumination [5][6]. However, it is easily affected by noise [5][6]. To obtain motion information in spatial domain, consecutive frame difference (CFD) is applied. For achieving object motion in frequency domain, we use phase correlation (PC) technique due to its remarkable accuracy and its robustness to uniform variations of illumination and signal noise in images [7]. Besides, the size of foreground and motion information, visual saliency is also important for key frame extraction [3]. For this, we employ visual attention cue using graph based visual saliency (GBVS) method [8]. There are several advances of using GBVS approach. It powerfully predicts human fixations within a frame more accurately. Again, it confirms higher saliency value at the centre of the image plane. Furthermore, it draws more attention to salient region robustly. GBVS method prepares saliency map based on the spatial contrast within a frame whether it contains any foreground object or not. It is revealed in this proposed method that when any object movement occur in a frame, the saliency value at almost each point of saliency map changes. This incidence motivates us to consider the difference of saliency maps between two consecutive frames as feature for video summarization. Because the saliency difference map provides more distinguishable information than the single saliency map. In the proposed method, a novel adaptive fusion scheme is also introduced for combining features. The weights of the fusion are learnt during training session. Therefore, it provides an opportunity to adjust the result of video summarization. Finally, the summary of the video is generated as per the skimming ratio provided by the user. Otherwise, the proposed approach generates summary of the original video based on the default skimming ratio. Therefore, the contributions of the paper are as follows: 1. We introduce a novel feature namely the peak of phase shift obtained from the phase correlation technique and apply for summarizing a video to extract robust motion information in frequency domain in the case of illumination changes; 2. We introduce another novel feature namely saliency difference to exploit the temporal salient information as the single saliency map does not indicate the salient changes; 3. we develop an adaptive fusion scheme to combine different features as the contents are not always the same within a video or among all videos; The structure of the remaining paper is as follows. Section 2 reviews related research. The proposed method is described in section 3. Experimental results as well as detail discussions are provided in section 4. Finally, a concluding remark is drawn in section RELATED RESEARCH In the literature, difference approaches have been proposed for summarizing various types of videos. These videos can be categories into egocentric video, user generated video, movie, endoscopic video, and surveillance video. Egocentric video is generally captured by wearable cameras for socio-behavioral analysis of camera wearer s daily life. For summarizing this type of video, region saliency is predicted in [9] using a regression model and storyboard are generated based on region importance score. In [10], story driven egocentric video is summarized by discovering the most influential objects within a video. Gaze tracking information is applied in [11] for egocentric video summarization using sub-modular function maximization. User generated video is usually captured by handheld cameras or smart phone by a non-processional. For summarizing user generated video, an adaptive sub-modular maximization function is applied in [12]. A collaborative sparse coding model is utilized in [13] for generating summary of the same type of videos. Web images are used in [14] to enhance the process of summarizing the user generated video. Category specific (eg. birthday party) user video is summarized in [15] by automatic temporal segmenting of video, scoring each segment applying support vector machine (SVM), and selecting higher segments. In [16], Deep Event Network (DevNet) is introduced for high level events detection and spatial-temporal important evidence localization in a user generated video. For detecting important events from user generated video, low level and semantic level features of visual and audio are applied in [17]. Movies or films captured by professional cinematographers contain high definition video and audio for entertainment. To summarize movie, aural, visual and textual features are merged in [18]. As role communities contain information about previous and later scene, network of role communities is applied in [19] for movie summarization. Film comic is generated using eye tracking data in [20]. To summarize endoscopic video, ORB (Oriented FAST and Rotated BRIEF) key-point descriptor is applied in [21]. A unsupervised learning method is proposed using visual and temporal descriptors in [22] to partition frames in homogeneous categories. Later, the most typical frames are selected to summarize endoscopic video. In [23], image moments, curvature, and multi-scale contrast are combined to generate the saliency map for each frame. This saliency map is used to select key frames for endoscopic video summarization. A hidden Markov model based frame is introduced in[24] for endoscopic video summarization. However, the importance of surveillance video summarization for providing security, monitoring a restricted area, preventing crime, controlling traffic [1] is higher than other types of video summarization (e.g., egocentric, user generated, movie,

3 Research Article Journal of the Optical Society of America A 3 etc.). Because, the main purpose of user generated video or egocentric video is to summarize daily social activities [9][10][11] [12][13][14]. Therefore, we motivated to propose a framework to summarize surveillance video. To summarize surveillance video, object centered technique is applied in [25]. Dynamic videobook is proposed in [26] for representing the surveillance video in a hierarchical order. Learned distance metric is introduced in [27] for summarizing nursery school surveillance video. In [28], motion saliency is calculated based on a dynamic visual saliency model using integral-image-based temporal gradients. Later, informative key frames are extracted based on motion contrast and the salient object s coverage ratio. Maximum a posteriori probability (MAP) is used in [29] for summary generation. The dynamic visual saliency is calculated by temporal gradient and the static saliency is measured by discrete cosine transform (DCT) in [30]. A non-linear weighted fusion method is applied to combine the static and dynamic saliency. In [31], the correlation of RGB color channel, color histogram and moments of inertia are combined to extract key frames. In [32], each video frame is divided into 8 8 blocks and DCT is applied on them. Later, a DC term of each block is calculated and a DC image is constructed which is 64 times smaller than the original frame. DC image is converted into HSV color space and a 256-dimensional color histogram is computed. After that, zero-mean normalized cross correlation is applied on the color histogram to select representative frames. Finally, color distribution and gradient orientation are applied for redundant frame removal. Recently, a method is proposed in [1] for surveillance video summarization. Single view summarization is generated in the approach for each sensor independently. For this purpose, MPEG-7 color layout descriptor is applied to each video frame and an online-gaussian mixture model (GMM) is used for clustering. The key frames are selected based on the parameters of cluster. As the decision of selecting or neglecting a frame is performed based on the continuously updates of these clustering parameters, a video segment is extracted instead of key frames. Video summarization technique using a single type descriptor (i.e., color descriptor) in frame-level with on-line learning (i.e., GMM) strategy provides very good performance if the video has uni-modal phenomenon, however, the technique may not perform well if the video has multi-modal phenomena such as illumination change, variation of local motion, occlusion. To overcome these, we need to use multi-modal features for selecting key frames as a uni-modal feature sometimes fails to capture a specific phenomenon. For example, foreground objects sometimes could not contain explicit motion information. Again, spatial motion among adjacent frames does not provide appropriate motion information for key frame selection in the case of illumination changes [5][6] and sometimes it provides over estimated motion areas comprising object areas, uncovered background areas and occluded areas. Moreover, motion information in frequency domain is able to provide better motion information in the case of illumination change, however, it is sensitive to noise and it lacks of localization problem [6]. Furthermore, visual saliency is able to predict human fixations in a frame more accurately, however, it may extract a non-interesting area where there is significant spatial contrast exist. Therefore, we propose a novel method combing all these human visual sensitive features. In addition to this, machine learning-based classifier, support vector machine (SVM), is also used to classify key frames using the proposed features. The results indicate that SVM-based key frame selection does not always provide very good fusion for the features. As a result, a new adaptive fusion scheme is proposed combining these features. 3. THE PROPOSED METHOD The proposed scheme is based on the area of foreground objects, their motion information in spatial and frequency domain, and visual saliency difference information. The main steps of the proposed method are (A) foreground object extraction, (B) motion information calculation in spatial domain, (C) motion estimation in frequency domain, (D) visual saliency difference calculation, (E) adaptive linear weighted fusion of these features, and (F) video summary generation with flexible length. The flow chart of the proposed method is shown in Fig. 1. The detail of each step is explained in the subsequent sub-sections. Video Frames Multi-modal Feature Extraction User Provided Video Summary Yes Learning Weights of Features Adaptive Fusion Scheme by Learned Weights Ranking of Frames User Provided Skimming Ratio Yes Key Frames Selection Summarized Video No Fusion Scheme by the Default Weights No System's Default Skimming Ratio Fig. 1. The conceptual framework of the proposed summarization method A. Foreground Object Extraction Foreground objects are the most informative parts in a video stream as they contain more detail information and play a major role for important events [1]. In order to obtain the foreground object information in a video frame Gaussian mixture-based DBM [4][33] is applied. In this DBM, each pixel is modeled by the K Gaussian distributions (K=3) and each Gaussian model represents either background or different foreground objects over the time i.e., in different frames. For instance, suppose a pixel intensity x t at time t is modeled by k th Gaussian with recent value γk t, ɛt k, standard deviation σt k and weight ρt k such that Σρ t k = 1. The learning parameter α is used to update parameter values, such as mean, standard deviation, etc. At the beginning, the system contains empty set of Gaussian models. After observing the first pixel (t=1), a new Gaussian model (k=1)

4 Research Article Journal of the Optical Society of America A Area of Foreground Spatial Motion Information Motion Information in Frequency Domain Visual Silency Difference Adaptive Weighted Fusion of Features Frames Fig. 2. A representation of features applied on bl-18 video; the first, second, third, forth, and fifth rows represent area of foreground objects, spatial motion information, frequency domain motion information, visual saliency difference, and adaptive fusion of all features respectively. The red and black lines represent ground truth key frames and threshold value respectively for video summarization. is generated with γk t = ɛt k = x t, standard deviation σk t =30 and ρ t k = Then for each new observation of pixel intensity x t of the same location at t, it tries to find a matched model from the existing models such that x t ɛ k 2.5σ k. If it is successful to find a model, the parameters of the matched model are updated as in [4][33]. Otherwise, a new Gaussian model is introduced similar to the first pixel. Interested readers may find the details explanation of Gaussian mixture-based DBM in [4][33]. In the proposed method, each coloured video frame is converted into gray scale image I(t) and DBM [4][33] is applied to obtain its corresponding gray scale background frame B(t). To obtain the foreground objects in a frame, the difference between I(t) and B(t) is calculated. In this way, a foreground pixel U i,j (t) is obtained as follows U i,j (t) = I i,j (t) B i,j (t) (1) where (i, j) is the pixel position. After that, the summation of U i,j (t) is used as the area of foreground object feature Γ(t) which is obtained by the following equation Γ(t) = b c i=1 j=1 U i,j (t) (2) where b and c represent row and column of U respectively. Fig. 2 shows a demonstration for key frame selection strategy based on a number of features in bl-18 video to show individual feature strength for video summarization. From the first row of Fig. 2, it reveals that the size of foreground based on DBM [4][33] provides good key frame detection, however, it takes little time to consider the non-motion objects (which has motion previously) as a background objects due to the adaptive learning process. As a result, some unnecessary frames i.e., frames after stopping the movement of an object, might be selected as the key frames if we consider the area of foreground. Therefore, only considering the foreground object is not sufficient to generate a better video summarization. To overcome this problem, in the proposed scheme motion feature is applied in addition to foreground feature. Again, according to the psychological theories of human attention, the static attention clue is less informative than motion information [2]. Therefore, motion information in spatial and frequency domain is included in the proposed method in addition to the foreground object. B. Motion Information Calculation in Spatial Domain Human being usually gives more attention to the moving objects in a video [2] to understand an event. In order to obtain object motion information in spatial domain, CFD is computed by considering two consecutive color frames F(t 1) and F(t) at time t-1 and t in video respectively. To find out spatial motion

5 Research Article Journal of the Optical Society of America A 5 information, the absolute color difference between these frames is calculated. Therefore, the spatial motion information S i,j (t) in pixel (i, j) at time t can be obtained by the following equation S i,j (t) = F i,j (t) F i,j (t 1) (3) where (i, j) is the pixel position. The spatial motion information Υ(t) is obtained at time t by summing all values in S i,j (t) as follows Υ(t) = b c i=1 j=1 S i,j (t) (4) where b and c represent row and column of S respectively. In Fig. 2, motion feature Υ(t) is shown in the second row. It is obvious that combining features of foreground Γ(t) and motion information Υ(t), unnecessary frames obtained by the foreground feature can be removed (see Fig. 2 first and second row). Again, motion alone is not a very good feature. Because, a tiny foreground object with significant motion information is less attractive as well as informative than a large foreground object with sufficient motion information. Therefore, combination of foreground and motion information can provide a better result for key frame selection process. To explain this visually, Fig. 3 shows some frames with a small portion of human head. Although these frames contain sufficient motion information of the human head areas, they might not be a suitable candidate to be key frames, as they do not have enough foreground area. Thus, these frames should not be selected as key frames. Moreover, CFD is sensitive to illumination changes [5][6] and it provides over estimated motion areas comprising object areas, uncovered background areas and occluded areas. Frame no. 717 Frame no. 723 Frame no. 729 Frame no Frame no Frame no Fig. 3. An illustration of small object with adequate motion information (taken from bl-3 video). These frames should not be selected as key frames. C. Motion Information Extraction in Frequency Domain To overcome the problem of spatial motion information, motion information is also calculated in frequency domain. Motion estimation in frequency domain has some advantages over estimated motion in spatial domain [6]. It is efficient for global changes of illumination and robust to motion estimation near object boundaries. To obtain motion information in frequency domain, each frame is divided into a number of blocks of 1616 pixels size. Then, phase correlation technique [7][34] is applied between the current block and reference block. The phase correlation peak (β) (i.e., the magnitude of the motion accuracy) extracted from phase correlation method is used as motion indicator for that block. The phase difference (θ) is calculated between the current block and its co-located reference block after applying Fast Fourier Transformation (FFT) on each block using the following equation θ = i f f t(e j( η re f η cur ) ) (5) where η re f and η cur represent phase of FFT of reference and current block respectively. The maximum phase difference is calculated as follows θ max = max(θ) (6) The phase correlation peak (β), the maximum movement is obtained by the following equation β = 1 θ max (7) If the value of of a block greater than a threshold δ, it is considered that this block contains sufficient motion information. In the proposed method, the value of δ set to All the values greater than δ are summed to obtain motion information (t) in frequency domain (t) = L M l=1 m=1 I l,m (t) (8) where L and M represent row/16 and column/16 of gray scale image I respectively. The phase difference calculated in frequency domain applying phase correlation technique at different blocks of frame no of bl-14 video is shown in Fig. 4. No motion is represented in block (4, 4) with only a single highest peak almost equal to 1 (Fig. 4(d)). Single motion (block (8, 9)) with the phase difference value equal to 0.8 and complex motions (block (8, 3)) values less than 0.2 are showed in Fig 4(e) and 4(f) respectively. The magnitude of the peak value inversely varies with motion. The figure reveals that if a block has no/little motion, the magnitude of the peak value is close to one, if a block has complex (cannot be represented using translational motion based on phase correlation) the magnitude is close to zero, and if a block has motion (that can be represented with single translational motion using phase correlation), the magnitude is around 0.5. Thus, the magnitude of a block using phase correlation could be a good feature for video summarization in the case of illumination changes and capturing local motion. The obtained values of the magnitudes change abruptly as shown in Fig. 5. Therefore, we apply Savitzky-Golay filtering [35] with window size ω (see values in Table I) to smooth data. The main advantage of this filtering is that it enhances local maxima [35]. After smoothing data, motion information in frequency domain (t) is obtained. In Fig. 2, the third row represents motion information in frequency domain. It is clear from the curve that it can represent the motion information more accurately than spatial motion information. The reasons are that it is efficient for global changes of illumination and robust to motion estimation near object boundaries. However, it also generates some extra motion information due to noise. Again, frequency based approach lacks of localization problem [6]. Therefore, motion information is calculated in both spatial and frequency domains for generating video summary.

1 Research Article Journal of the Optical Society of America A 6 (a) (b) (c) (d) (e) (f) Fig. 4. An example of motion generated in each block of frame no. 3869 of bl-14 video; (a) frame no.

3869, phase correlation pick with no motion, single motion and complex motion are represented in (d), (e) and (f) respectively.

6 1 Research Article Journal of the Optical Society of America A 6 (a) (b) (c) (d) (e) (f) Fig. 4. An example of motion generated in each block of frame no of bl-14 video; (a) frame no. 3868, (b) frame difference between 3868 and 3869 (multiplied by 6 for better visualization), (c) frame no. 3869, phase correlation pick with no motion, single motion and complex motion are represented in (d), (e) and (f) respectively. Frequency Motion Information Smooth Frequency Motion GroundTruth Motion Values Frames Fig. 5. A representation of frequency domain motion information obtained by Phase correlation and smoothed by Savitzky-Golay filtering [32] for bl-14 video with window size 100. The green, blue and red lines indicates ground truth frames, raw and smooth frequency domain motion information.

7 Research Article Journal of the Optical Society of America A 7 D. Visual Saliency Difference Calculation Visual attention cue is a significant sensitive feature to measure the indication of user s attraction label for determining key frames [3]. In order to calculate the visual attention, GBVS method [8] is applied on each frame of a video stream. In GBVS method, graph algorithm is applied due to its computational power, topographical structure, and parallel nature. A fully connected directed graph is obtained by connecting each nodes of a feature map. The weights of the directed edges are assigned according to their dissimilarity and closeness between two nodes. A Markov chain is defined on this directed graph and the equilibrium distribution of this chain reflect the activation values. Later, to concentrate the activation values to the most salient regions, another weighted graph is constructed from these activation values. The weight of the edge of this graph is assigned based on the activation values between two nodes. A Markov chain over this graph offers to calculate the equilibrium distribution over the nodes. In this way, the more attractive regions get more saliency values. Therefore, more uniform and informative saliency map is obtained. It is revealed in this proposed method that when any object s movement occurs in a frame, the saliency value at almost each point of saliency map changes. This incidence motivates to calculate the difference of saliency maps between two consecutive frames. Because the saliency difference map provides more distinguishable information than the single saliency map. After obtaining saliency map, the sum of the saliency difference between two consecutive frames is used as another feature. Suppose, at time t and t-1, two consecutive color frames are F(t) and F(t 1) and their corresponding visual saliency map are V(t) and V(t 1) respectively. The saliency difference H(t) between V(t) and V(t 1) is calculated as follows H i,j (t) = V i,j (t) V i,j (t 1) (9) where (i, j) is the pixel position. Following that, the visual saliency difference feature Λ(t) is obtained by the following equation b c Λ(t) = H i,j (t) (10) i=1 j=1 where b and c represent row and column of H respectively. The visual saliency difference of bl-18 video is shown in the fourth column of Fig. 2. It is easily visible that there are some foreground objects (first row) and tiny motion information (second row) near frame no However, this is not a part of ground truth. In this case, including saliency difference feature to other features (foreground and motion) can provide better result as shown in fourth row of Fig. 2. The difference of saliency maps between two consecutive frames keeps the relative information between them. However, visual saliency difference does not provide the foreground information and accurate motion information. E. Adaptive Linear Weighted Fusion In this approach, a novel adaptive linear weighted fusion scheme is proposed to combine the features for ranking each frame according to their representativeness in a video. Before applying fusion method, each feature is converted into z-score normalization (Z(t)) using the following equation Z(t) = X(t) µ ϱ (11) where X(t) is a feature value at time t, µ is the mean and ϱ is standard deviation of the feature values. Z-score, Z(t) is a normalized form of X(t). In this scheme, z-score normalization is the preferred method because it produces meaningful information about each data point, and provides better results in the presence of outliers than min-max based normalization [36]. The weighted linear fusion is obtained as follows R(t) = w 1 Z Γ (t) + w 2 Z Υ (t) + w 3 Z (t) + w 4 Z Λ (t); (12) where R(t) is the fusion value; Z Γ (t), Z Υ (t), Z (t) and Z Λ (t) are z-score normalization of foreground feature Γ(t), spatial motion feature Υ(t), motion information in frequency domain (t), and visual saliency difference Λ(t) respectively at time t and w 1, w 2, w 3 and w 4 are weights assigned to Z Γ (t), Z Υ (t), Z (t) and Z Λ (t) respectively. The weights values are obtained through a learning process. In the learning step, a small video segment containing important event(s) and unnecessary frames is applied. Frames within important event(s) are selected as key frames and remaining frames are considered as non-key frames. The value of each weight is assigned between 0 and 100. The values of four weights (w 1, w 2, w 3 and w 4 ) are selected such a way that the total summation of these values are equal to 100. Different combination of weights values is applied in Eq.12. For each combination, fusion values are calculated, sorted in descending order, and selected the number of frames from the top similar to the number key frames provided during learning step. After that, the selected frames are matched with the key frames and a score is provided based on the similarity between the selected frames and the key frames to the combination. Finally, a set of weight values with the highest score is selected from the all the combination of weights values and set for the entire video. After that, the fusion values R(1), R(2), R(3),...R(Y) (where Y is the total number of frames in a video) are calculated using Eq.12 with the set of weight values obtained during learning phase. Fusion values are sorted based on descending order. In Fig. 2, the last row presents fusion values for bl-18 video of BL-7F dataset [1]. The proposed fusion method combines all the features in such a way that it is successfully to suppress the unnecessary frames and to highlight the most informative frames. F. Video Summary Generation with User Preferences In the final step, a set of key frames are selected from the sorted fusion values. The introduced approach offers the user to select the skimming ratio (λ) for the summarized video. Otherwise, this approach uses the default value of λ. It generates summarized video based on λ. From the sorted fusion values R(1), R(2), R(3),..., R(Y); video frames are selected from the top based on λ. Finally, summarized video is produced from these selected frames keeping their sequential order in the original video. In Fig. 2, the black line in last row indicates the threshold value of λ. Form this figure, it is easily seen that the proposed method can provide consistence results with the ground truth. 4. RESULTS AND DISCUSSION The proposed method is evaluated by the publicly available BL- 7F dataset [1], Office [37] and Office Lobby dataset [37]. In BL-7F dataset, 19 surveillance videos are taken from fixed surveillance cameras located in the seventh floor of the BarryLam Building

Research Article Journal of the Optical Society of America A 8 in National Taiwan University. The duration of each video is 7 minutes 10 seconds and contains 12,900 frames.

In Office dataset [37], four videos are collected from stably held with non-fixed cameras. The main difficulties are the vibration of camera and different lighting conditions.

However, they contain crowded scene with richer activities compared to Lobby and Office datasets. The ground truth key frames for both Office and Office Lobby dataset are also publicly available.

6, an illustration of foreground object extracted by the proposed method is shown. Frame no. 820 from bl-1 video from BL-7F dataset is selected to represent the foreground area.

The foreground object of frame 820 after applying Eq.1 1 is provided in Fig. 6(c). It is demonstrated by Fig.

An illustration of frame-to-frame motion information estimation, (a) and (b) are frame no. 820 and 819 of bl-1 video respectively, (c) the object motion between frames no 820 and 819. In Fig.

7(a) and 7(b)) from bl-1 video taken from BL-7F dataset are selected for this purpose. The object motion information between these two consecutive frames (820 and 819) using Eq.

8 Research Article Journal of the Optical Society of America A 8 in National Taiwan University. The duration of each video is 7 minutes 10 seconds and contains 12,900 frames. This dataset also provides a complete list of selected key frames as a ground truth for each video in BL-7F dataset. In Office dataset [37], four videos are collected from stably held with non-fixed cameras. The main difficulties are the vibration of camera and different lighting conditions. Similarly, three videos are collected in Office Lobby dataset [37], with stably held but non-fixed cameras. However, they contain crowded scene with richer activities compared to Lobby and Office datasets. The ground truth key frames for both Office and Office Lobby dataset are also publicly available. (a) (b) (c) (d) (e) (f) (a) (b) (c) Fig. 6. An example of foreground objects extraction. (a) Frame no. 820 (gray scale) of bl-1 video, (b) background frame of (a) and (c) foreground objects. In Fig. 6, an illustration of foreground object extracted by the proposed method is shown. Frame no. 820 from bl-1 video from BL-7F dataset is selected to represent the foreground area. This frame is converted into gray scale image and shown in Fig 6(a). The corresponding gray scale background image of 820 frame obtained by DBM-based method is shown in Fig. 6(b). The foreground object of frame 820 after applying Eq.1 1 is provided in Fig. 6(c). It is demonstrated by Fig. 6 that the proposed technique is capable of extracting foreground region applying DBM [4] for selecting important event information. (a) (b) (c) Fig. 7. An illustration of frame-to-frame motion information estimation, (a) and (b) are frame no. 820 and 819 of bl-1 video respectively, (c) the object motion between frames no 820 and 819. In Fig. 7, the CFD for spatial motion information extracted by the proposed method is presented. Two consecutive (820 and 819) frames (Fig. 7(a) and 7(b)) from bl-1 video taken from BL-7F dataset are selected for this purpose. The object motion information between these two consecutive frames (820 and 819) using Eq. 3 is represented in Fig. 7(c). It is confirmed by Fig. 7 that the proposed method is very competent to estimate spatial motion information. In Fig. 8, an example of the visual saliency difference calculation is displayed. Figure 8(a) and 8(d) show 820 and 819 frames respectively from bl-0 video of BL-7F dataset. Figure 8(b) and 8(e) are the corresponding saliency maps obtained by GBVS algorithm [8][8]. The overlaid images of saliency maps on Fig. 8(a) and 8(b) are shown in Fig. 8(c) and 8(f) respectively. It is easily visible that the most salient regions in Fig. 8(c) and 8(f) are not (g) (h) (i) Fig. 8. A representation of visual saliency difference calculation, (a) and (d) are two consecutive (820 and 819) frames of bl-1 video respectively, (b) and (e) are saliency map of (a) and (d) respectively obtained by [8], (c) and (f) are overlaid saliency map of (a) and (d) respectively. (g) The saliency map difference between (b) and (e); (h) The multiplication of (g) by 5 for clear visualization, (i) The overlaid saliency map difference on (a). very attractive. This information does not provide an accurate indication to select key frames. To overcome this problem, the difference of two consecutive saliency maps is calculated and used as one of the features. Figure 8(g) shows the difference of saliency map of Fig. 8(b) and 8(e). For better visualization, Fig. 8(g) is multiplied by 5 and displayed in Fig. 8(h). The overlaid saliency map of Fig. 8(g) on Fig. 8(a) is shown in Fig. 8(i). It is observed from Fig. 8(i) that the difference of visual saliency represents the more accurate salient region. The motion information in frequency domain extracted by phase correlation technique is shown in Fig 9. Frames no 740 and 741 of bl-0 video are shown in Fig. 9(a), and 9(b) respectively. The motion information extracted by the phase correlation technique is presented in Fig. 9(c). (a) (b) (c) Fig. 9. An example of motion information extracted by the phase correlation method; (a) and (b) are frame no. 820 and 819 of bl-1 video, (c) is the motion obtained by phase correlation technique on frame no To evaluate the proposed method, an object comparison has been performed. For this purpose, a set of evaluation metrics including precision, recall and F-measure are computed. The

9 Research Article Journal of the Optical Society of America A 9 F_foreground F_spatial F_saliency F_frequency F_proposed F_measure bl 0 bl 1 bl 2 bl 3 bl 4 bl 5 bl 6 bl 7 bl 8 bl 9 bl 10 bl 11 bl 12 bl 13 bl 14 bl 15 bl 16 bl 17 bl 18 lobby 0 lobby 1 lobby 2 office 0 office 1 office 2 office 3 Video Fig. 10. F-measures of area of foreground object, spatial motion, saliency difference, and frequency motion features for BL-7F, Lobby, and Office datasets. definition of precision and recall are as follows Precision = Recall = t p t p + f p (13) t p t p + f n (14) where t p, f p and f n are the number of frames selected by a method and the ground truth, the number of frames selected by a method but not by the ground truth, the number of frames selected by the ground truth but not a method respectively. However, either precision or recall alone cannot provide a good indication of perfect measurement. For example, a method can offer better precision but poor recall or vice-versa. To be an efficient and robust method, it must achieve both higher precision and recall. To represent this measure, F-measure is defined combining precision, recall, and represented as follows F 1 measure = 2 Precision Recall Precision + Recall (15) A high value for both precision and recall indicate that F- measure is also very high. Thus, a method with high F-measure value confirms better summarization technique. The graphical representations of F-measure of the proposed method using only foreground objects (F-foreground), spatial motion (F-spatial), saliency difference (F- saliency), motion in frequency domain (F-frequency), and combination of all are shown in Fig.10. After examining this graph, it is obvious that the proposed method using only foreground objects shows better performance in bl-7, bl-8, bl-9, bl-10, bl-11, bl-15, bl-16, lobby-1, lobby-2, and office-3 than other features. If only spatial motion is considered, it performs superior in bl-0, bl-2, bl-4, bl-12, bl-14, bl- 15, bl-16, bl-17, lobby-0, and lobby-1 videos. Again, the proposed method applying only phase correlation technique outperforms in bl-1, bl-3, bl-5, bl-6, bl-18, office-0, office-1, office-2, and office- 3 videos. It is revealed that in case of illumination change it performs better than other features. Saliency difference feature performs better in lobby-1, and lobby-2 that motion in frequency domain feature and in office-2 video that area of foreground feature. Therefore, the proposed approach combines all these features and performs superior to GMM based method in all videos in BL-7F, Office, and Lobby dataset. The proposed approach is compared with the single-view video summarization results provided by GMM-based method [1], saliency directed prioritization (SDP) method [28] and summarization on compress domain (SCD) method [32]. These methods are is the most relevant and the state-of-the-art method to summarize surveillance video. As the proposed method applies Gaussian mixture model, we compare the proposed method with another GMM-based method. There are key differences between GMM based method [1] and the proposed method. Firstly, GMM based method works at frame level whereas Gaussian mixture-based DBM [33] applied in the proposed works in pixel level. Secondly, GMM based method utilizes color descriptor as feature while the proposed method uses three human visual sensitive features such as foreground objects, motion information in spatial and frequency domains, and visual saliency map difference. Since the proposed method employs saliency map difference as a feature, we compare the proposed method with recently proposed SDM method which also applies saliencybased technique. The proposed method applies phase correlation (PC) technique and SCD method implements discrete cosine transform (DCT) technique. As both PC and DCT works on frequency domain, the proposed method is also compared with SCD. In Fig. 11, a number of ground truth frames of bl-11 video of BL-7F dataset [1] and the results obtained by GMM based method as well as the proposed method are shown. Although there are significant contents in frame-number 9963 and 12523, GMM based method fails to select these frames. In contrast, the proposed method is capable to select these frames successfully. The main reason of this success is that the proposed method combines area of foreground objects, visual saliency difference, as well as frequency and spatial motion information. In this proposed method, the user preferred skimming ratio λ is set to the total number of ground truth key frames for each video. It is evaluated that the introduced scheme generates more accurate results if λ + 2% of λ frames are selected from the ranked sorted list of R(1), R(2), R(3),... R(Y) where Y is the total number of frames in a video. The default value of λ is set to 20% of the total number of frames of a video. This skimming ratio is also consistence with some other existing methods [38] [39]. In Fig. 12, the skimming ratio of the ground truth key frames and the total number of frames and the default skimming ratio (20% of the total video frames) for BL-7F, Lobby, and Office [1][37] are shown. It is clear from the graph is that the default skimming ratio is all most consistence of the ground truth skimming ratio provided in [1][37]. The values of the different weights(w 1, w 2, w 3, w 4, and ω) used in the proposed method obtained by the adaptive fusion method are shown in Table 1. The table reveals that the values of weights of w 1, w 2, w 3, and w 4 ) vary from 5% to 85% depends on the nature of videos. The average weights of foreground size and motions are the larger compared to the weight of the saliency feature. We may observe this because for a video motion and amount of foreground are the two most prominent human

10 Research Article Journal of the Optical Society of America A 10 Frame No Ground Truth GMM [1] The Proposed Method Not Selected Not Selected Fig. 11. Evaluation of key frames extraction of bl-11 video of BL-7F dataset; first, second, third, and forth columns indicate frame no., ground truth, results obtained by GMM based method and the proposed method respectively. visual features compared to the salience variation within frames. The results of precision, recall, and F-measure of the proposed method, GMM based method (intra-view), SDP-based method (intra-view) [28] and SCD-based method [32] are shown in Table 2. It is observed from Table 2 that the mean values of the F1-measure for BL-7F dataset obtained by the proposed method is 92.5 whereas those achieved by GMM-based method, SDP-based method and SCD-based method are 66.6, 79.8 and 41.6 respectively. For Lobby dataset, the mean F1-measure of the proposed method is 83.0 which is higher than GMM-based method (80.0), SDP-based method (74.0) and SCD-based method (48.1). In the case of Office dataset, the highest mean of F1- measure is obtained by the proposed method scoring The mean of F1-measure obtained in Office dataset by GMM-based method, SDP-based method and SCD-based method are 53.0, 61.7 and 21.5 respectively. The proposed method achieves higher F1-measure than existing and relevant methods for all videos of BL-7F, Office and Lobby datasets except bl-12 video of BL-7F dataset. Table 2 indicates that the proposed method not only performs in higher accuracy, but also the variance of the performance is more consistent in the different videos of Bl-7F, Office, and Lobby dataset compared to the existing and relevant stateof-the-art methods. The results of F-measure of the proposed method with adaptive weight scheme, average-weight approach, default skimming ratio and SVM along with GMM based approach [1] are shown in Fig. 13. In the proposed method, a well-known SVM library LIBSVM [40] is applied to train a model and applied on a set of test images to obtain key frames. We have employed radial basis function (RBF) as a kernel function as it maps the feature non-linearly in the high dimensional space so that it can handle the non-linear relationship between class labels and attributes [40]. From this graph, it is observed that the proposed method with adaptive weights performs superior all videos of Bl-7F, Office, and Lobby dataset to the recently proposed the-state-ofthe-art GMM based approach [1] except bl-12 video of BL-7F dataset and other approaches (the proposed method with average weight scheme, default skimming ratio and SVM-based scheme). The proposed method with average weights similar to the proposed method with adaptive weight achieves better in lobby-0 and lobby-2 video of Lobby dataset. However, it performs worse in bl-2, bl-12, and bl-15 videos. The proposed method with default skimming ratio performs almost the similar to GMM based method for bl-0, bl-2, bl-6, bl-14, bl-17 video of BL-7F dataset [1] and office-0, office-1, office-2 videos of Office dataset [37]. However, it performs worse in bl-11, bl-12 video of BL-7F dataset [1], lobby-0, lobby-1, lobby-2 video of Office Lobby dataset [37]. The main reason of worse performance for these videos is that the default skimming ratio (20%) is much less than the ground truth skimming ratio (Fig. 12). Therefore, the proposed method with default skimming ratio ignores some key frames. To overcome this problem, we offer the user to select the skimming ratio. The proposed method with SVM also achieves better result as the proposed method with adaptive weights in bl- 15 video of Bl-7F dataset. In contrast, it shows less performance in bl-17 video of BL-7F dataset, lobby-0, lobby-1, and lobby-2 of

11 Skimming Ratio Research Article Journal of the Optical Society of America A Skimming Ratio by Ground Truth Default Skimming Ratio Videos Fig. 12. A comparison of the skimming ratio provided by the ground truth and proposed by the system F1 measure F_gmm F_proposed+Adaptive F_proposed+Avg Weighted F_proposed+SVM F_proposed+DefaultSkimming Video Fig. 13. F-measures of GMM based method (intra-view)[1], proposed method with adaptive, average and default weight values, and SVM.

12 Research Article Journal of the Optical Society of America A 12 Table 1. The threshold values obtained by the proposed adaptive fusion scheme Datasets Video w 1 w 2 w 3 w 4 ω BL-7F Lobby Office bl bl bl bl bl bl bl bl bl bl bl bl bl bl bl bl bl bl bl lobby lobby lobby office office office office Average Lobby dataset [37], and office-1, and office-2 of Office dataset [37][34]. GMM based approach attains poor performance in bl-0, bl-1, bl-3, bl-4, bl-5, bl-6, bl-7, bl-8, bl-9, bl-10, bl-11, bl-13, bl-14, bl-16, bl-18 videos of BL-7F dataset, and office-0, and office-3 of Office dataset [37]. The main reason of the poor performance of GMM based method [1] is that it applies only MPEG-7 color layout descriptor as a feature. It does not consider pixel wise foreground object, motion, and human visual salient information. As the proposed method considers these human visual sensitive features, it outperforms GMM based method. The proposed method with average weight scheme is not performing better than the proposed method with adaptive weight because the contents are not the same for all videos. Therefore, fixed weight values assigned to a video does not guarantee better result for another video. Again, the proposed method with SVM shows inferior performance to the proposed method with adaptive weights. The results of SVM depend on the choice of kernel. Moreover, discrete data does not always provide better result using SVM classifier [41]. However, GMM based approach performs the best only in bl-12 video of BL-7F dataset. After observing the key frames extracted by the proposed method for bl-12 video, the reasons of failure have been explored. In bl-12 video, there are some frames with significant object and motion, however, they are not selected as ground truth frames according to [1]. Again, although there is no foreground object and/or motion, in some frames, they are considered as ground truth. For example, frame no. 4083, 4120, and 4563 contain sufficient amount of object, and motion as shown in first row of Fig. 14. In these frames, it is clearly visible that a person is working near the door. However, these frames are not selected as ground truth (key frames). On the other hand, there is no object, and significant motion exist in frame no , and (see the second row of Fig. 14). Nonetheless, they are selected as key frames (ground truth). There is no explanation found about this incident in [1]. Fig. 14. Sample frames of bl-12 are not selected as ground truth (first row) and considered as key frame (second row). 5. CONCLUSION In this paper, an effective and robust framework is proposed to summarize surveillance video by combining human visual sensitive features such as area of foreground object, motion information in spatial and frequency domain and visual saliency difference in adjacent frames. According to [1], foreground objects usually contain details information of the video contents. Moreover, human being naturally gives more attention to object motion in a video [2]. Furthermore, visual attention cue is a

Related Work A new parameter-less tangent estimation method is proposed for conic part 94 construction; 95

Related Work A new parameter-less tangent estimation method is proposed for conic part 94 construction; 95 Video Summarization using Line Segments, Angles and Conic Parts Md Musfequs Salehin *, Manoranjan Paul, Muhammad Ashad Kabir School of Computing and Mathematics, Charles Sturt University, Bathurst, NSW-2795,