Harris-SIFT Descriptor for Video Event Detection based on a Machine Learning Approach

Size: px

Start display at page:

Download "Harris-SIFT Descriptor for Video Event Detection based on a Machine Learning Approach"

Edwin Brown
6 years ago
Views:

1 Harris-SIFT Descriptor for Video Event Detection based on a Machine Learning Approach Guillermo Cámara-Chávez Computer Science Department Federal University of Ouro Preto Campus Universitário - Morro do Cruzeiro Ouro Preto/MG - Brazil guillermo@iceb.ufop.br Arnaldo de Albuquerque Araújo Computer Science Deparment Federal University of Minas Gerais Av. Antônio Carlos 6627 Belo Horizonte/MG- Brazil arnaldo@dcc.ufmg.br Abstract Video data is becoming increasingly important in many commercial and scientific areas with the advent of applications such as digital broadcasting, video-conferencing and multimedia processing tools, and with the development of the hardware and communications infrastructure necessary to support visual applications. The objective of this work is to propose a method for event detection in a video stream. We combine Harris-SIFT descriptor with motion information in order to detect human actions in video. We tested our method in KTH database and compared it to spacetime interest points (STIP) descriptor. The results obtained achieved similar results to the STIP method. 1. Introduction In the recent years, digital video has been growing rapidly. Nowadays, most digital cameras and many cell phones can record videos and consumers are recording and uploading their videos online. As a consequence, many video sharing sites like YouTube, Google Video, etc., have grown. YouTube had over 6 million videos in August 2006 [6] and has grown to over 90 millon in December 2008 [5]. Unfortunately, these video search engines (YouTube, Google Video, etc.) often rely on just a filename and text metadata in the form of closed captions. This results in a disappointing performance, as quite often the visual content is not mentioned, or properly described by the associated text. Due to the limitations of text search, we would like to search based on the automated recognition of the visual events in the video. Many video search engines that analyze visual information only process the key-frames in the video [15, 1]. Events are long-term temporal objects, which usually extend over tens or hundreds of frames. Polana and Nelson [16] made a classification of temporal events, dividing them into three groups: (i) temporal textures which are of indefinite spatial and temporal extend (e.g., flowing water), (ii) activities which are temporally periodic but spatial restricted (e.g., walking, running) and (iii) motion events which are isolated events that are not periodic, i.e., they do not repeat either in space or in time (e.g., smiling). In this paper, we refer to temporal events as the activities. Thus, our objective is to automatically detect some events that occur in a video based on local features. Methods based on local features or interest points have shown a promising result in action recognition [19, 22]. This approach uses only several relevant parts of a whole spatiotemporal volume avoiding less informative regions like stationary backgrounds. The most straight-forward way to detect interesting regions consists in extending 2D interest point detection algorithm (e.g., methods presented in [18]) to 3D video analysis. Laptev [8] extended 2D Harris corner detector to a 3D Harris detector. This extended feature detector find regions with high intensity variations in both spatial and temporal dimensions. A certain limitation of this descriptor is the small amount of detections which is insufficient for most part based classifiers [22]. Dollar et al. [4] improved the 3D Harris detector relaxing its constraints. Their detector applies Gabor filtering to the temporal domain only and selects regions which give high responses. Eli Shechtman and Michal Irani [20] proposed a method that extends the notion of traditional 2D image correlation into a 3D space-time video-template correlation. Many methods explore the temporal dimension in order to obtain a better performance. Based on this, we decided to use the Harris-SIFT [27] descriptor, which is a 2D feature, complemented with motion vectors computed through the phase correlation method. In the literature, most of the existing frameworks in

Figure 1. Our proposed model for event detection. video event detection are conducted in a two-step procedure. In the first step, representative features are extracted.

2 Figure 1. Our proposed model for event detection. video event detection are conducted in a two-step procedure. In the first step, representative features are extracted. The second step is called the decision making process. Knowledge-based approaches typically combine the output of different media descriptors into rule-based classifiers. Some statistical approaches were used to detect events like C4.5 decision trees [21], support vector machines [14], etc. to improve the framework robustness. There also exist clustering techniques (k-means [25], Linde-Buzo-Gray algorithm [11], etc.) used for constructing a middle-level visual vocabulary where each visual word is formed by a group of local descriptors. Each visual word is considered as a robust and denoised visual term for describing images. This method is known as bag-of-features (BoF) and was used to deal with the problem of object/event detection [23, 17]. We adopted a clustering method for constructing the visual vocabulary, then a conventional machine learning algorithm is applied to classify the word frequency vector created from the vocabulary. 2. Machine learning approach In Fig. 1, we show our proposed model, which consists of two principal steps: the training and the testing steps. The training step is divided in three phases: detection and feature extraction of keypoints of interest within motion regions, video codebook generation and classification of codewords. The testing step is also composed of three steps: detection and feature extraction of keypoints of interest within motion regions, compute of the bag-of-words and event detection. In the detection and feature extraction of keypoints phase, we detect the keypoints of interest that are within a region with motion. Motion regions of currect frame f t are detected by a simple difference of previous frame f t 1 and next frame f t+1. Pixels with differences greater than a certain threshold are considered as part of motion regions. The Scale-Invariant Feature Detection (SIFT) [10] method is used for detecting the keypoints. The number of keypoint is refined by selecting those SIFT features with salient corner in the neighborhoods. We use the Harris corner detector for computing the corner response in original image, this method is known as Harris-SIFT method [27]. In order to have temporal information, we use the phase correlation method for detecting the motion vectors of each keypoint. A block of 8 8 pixels around each keypoint in frame f t 1 is searched in a corresponding block of pixels in frame f t+1. Histograms of velocity vectors are also generated in this phase. Another descriptor we use is the entropy of the phase correlation between the blocks, in order to determine the similarity between the blocks. After calculating all features, we execute the next phase which consists in clustering the features. The Linde-Buzo- Gray (LBG) clustering algorithm [9] is used to form an appearance codebook. A location codebook is formed from clustering of spatiotemporal locations of the interest points. The interest point features are then vector quantised into codewords according to the codebooks and each video sample is eventually represented as an occurrence histogram of codewords. Occurrence histograms reflect how many features points are assigned to each cluster. We use these histograms to train the SVM classifier [2]. In the test step, we evaluate the accuracy of our method. We enter with unknown video sequence and extract the same features used in the previous step. We calculate the occurrence histogram using the centroids found in the clustering process. Finally, the histogram is used as the input for our SVM classifier. 3. Local descriptors We use the algorithm presented in [27] which is an improvement of SIFT descriptor. Harris-SIFT detector selects those SIFT features with salient corner in the neighborhoods, improving the salience of the SIFT feature and at the same time reducing the time complexity SIFT descriptor This descriptor consists of four major stages: (1) scalespace extrema detection, (2) keypoint location, (3) orientation assignment and (4) keypoint descriptor [10]. The scale-space extrema detection searches over all scales and image locations. In the keypoint location stage, at each candidate location, a detailed model is fit to determine location, scale, and contrast. The orientation assignment provides

one or more orientations are assigned to each keypoint location based on local image properties, providing invariance to orientation and scale.

However, SIFT method suffers from three problems [27]: (1) it takes much time due to the complexity for computing the SIFT features; (2) longer detection time due to the large amount of features; and

3 one or more orientations are assigned to each keypoint location based on local image properties, providing invariance to orientation and scale. And finally, the keypoint descriptor stage computes image gradients which are measured at the selected scale in the region around each keypoint. However, SIFT method suffers from three problems [27]: (1) it takes much time due to the complexity for computing the SIFT features; (2) longer detection time due to the large amount of features; and (3) SIFT feature set is not salient enough because it can not accurately localize the corners Harris-SIFT descriptor This descriptor checks and keeps those SIFT features with salient corner in the neighborhood in order to improve the distinctiveness of the feature set. Keypoints found by SIFT detector that are not near a corner are discarded, speeding up the features detection. Harris-SIFT executes the first stage (scale-space extrema detection) of the SIFT detector. At this point, it is possible to compute the corner response in original image. SIFT feature can not precisely detect corner located points due to scale-space smoothing and pixel error. Mikolajczyk [13] compared different feature detectors and found that Difference of Gaussian is only an approximation to Laplacian of Gaussian, and has certain pixel errors in the feature position. In order to avoid this problem, we have to examine the neighborhood of SIFT feature finding out whether there is a corner next to SIFT feature. Here, we apply Harris corner detector [7] defined by formulas (1) and (2) [ µ(x, σ I, σ S ) = g(σ I ) L 2 x(x, σ S ) ] L x L y (x, σ S ) L x L y (x, σ S ) L 2 y(x, σ S ) (1) cornerness = det(µ(x, σ I, σ S )) α trace 2 (µ(x, σ I, σ S )) (2) where σ I is the integration scale, σ S is the scale of the SIFT feature, and L α is the derivative computed in α direction. If the maximum corner response within the neighborhood is greater than a threshold, the feature is supposed to be salient, and later a SIFT descriptor is generated. 4. Motion detection The phase-correlation method (PCM) [26] measures the motion directly from the phase correlation map (shift in the spatial domain is reflected as a phase change in the spectrum domain). This method is based on block matching: for each block r in frame f t 1 it is sought the best match in the neighborhood around the corresponding block in frame f t+1. In this work, we chose a block size of 8 8 pixels f t 1 f t+1 Figure 2. Block matching of phase correlation. (the same size of SIFT block). In Figure 2, a block of 8 8 pixels in frame f t 1 is searched in the corresponding neighborhood of pixels in frame f t+1 in order to find the correlation coefficients and offset. The PCM for one block is defined as: ρ(r t ) = F T 1 { r t 1 (ω) r t+1 (ω)} rt 1 (ω) 2 dω r t+1 (ω) 2 dω where r t is the spatial coordinate vector and ω is the spatial frequency coordinate vector, r t 1 (ω) denotes the Fourier transform of block r t 1, F T 1 denotes the inverse Fourier transform, {} is the complex conjugate and ρ(.) is the block of correlation coefficients. We use the entropy E r of the block r as the goodnessof-fit measure for each block. The entropy give us global information of the block. Thus for each keypoint detected by Harris-SIFT in frame f t, we use corresponding blocks in frame f t 1 and f t+1 in order to obtain first a similarity measure based on entropy and the corresponding histograms of motion vectors using for that the phase correlation method. 5. Codebook generation After calculating the local features a process of clustering, Bag-of-features (BoF) method [12, 3], is executed. Based in the clusters formed (the clusters are referred to as codebook and a single cluster is referred to as visual word), occurrence histograms are generated and a classifier is trained on these new features (histograms). Occurrence histograms reflect how many feature points are assigned to each cluster. The Linde-Buzo-Gray (LBG) clustering algorithm [9] is usually used to design a codebook for encoding images in the vector quantization. In each iteration of this algorithm, we must search the full codebook in order to assign the training vectors to their corresponding codewords. The steps of the LGB algorithm for the design of a N vector codebook are straightforward and intuitive. Starting with large training set (much larger than N), one first (3)

4 selects N initial codevectors. Initial codevectors can be selected randomly from the training set. There are two basic steps in the algorithm: enconding of the training vectors and computing of the centroids. To begin, we first encode all the training vector using the initial codebook. This process assigns a subset of the training vectors to each cell defined by the initial code vectors. Next, the centroid is computed for each cell. The centroids are then used to form an updated codebook. The process the repeats iteratively involving a recoding of the training vectors and a new computation of the centroids, to update de codebook. Ideally, at each iteration, the average distortion is reduced until convergence. Once descriptors have been assigned to clusters to form feature vectors, we reduce the problem of generic visual categorization to that of multi-class supervised learning, with as many classes as defined visual categories. The categorizer performs two separate steps in order to predict the classes of unlabeled images: training and testing. During training, labeled data is sent to the classifier and used to adapt a statistical decision procedure for distinguishing categories. We use the SVM classifier since is it often known to produce state-of-the-art results in high dimensional problems. 6. Support vector machine (SVM) There are some learning approaches that use SVM as classifier. Recently, Schuldt et al. [19], and Wong Shu- Fai and R. Cipolla [22] transformed video event detection into a multi-class categorization. Schuldt et al. [19] use a SVM to classify a 3D Harris detector. Wong Shu-Fai and R. Cipolla [22] tested different classifiers: quantised feature vector (generates a codebook using k-mean clustering), probabilistic latent semantic analysis (unsupervised method), SVM and nearest neighbour classifier. Among these classifiers, SVM outperforms of all the other classifiers. The SVM has been developed as a robust tool for classification and regression in noisy and complex domains as multimedia retrieval [24]. SVM can be used to extract valuable information from data sets and construct fast classification algorithms for massive data. Another important characteristic of the SVM classifier is to allow a non-linear classification without requiring explicitly a non-linear algorithm thanks to kernel theory. Common kernel functions are: linear, polynomial, Gaussian radial basis and triangular kernel. Each kernel function results in a different type of decision boundary. 7. Results The human activity data set (KHT) used in the experiments was collected by Schuldt et al. [19]. There are totally Figure 3. Examples of KHT dataset [19]. six human activities including boxing, hand clapping, hand waving, jogging, running and walking (see Figure 3). There are 25 subjects engaged in the above activities in four different scenarios including indoor, outdoor, changes in clothing and variations in scale. Each video sample contains one subject engaged in a single activity in a certain condition. Even though the KHT data set was taken over homogeneous backgrounds with a static camera, there is some noise embedded in it. The image sequences appear, in some parts, out of focus (blurred images) due to camera movement (zoom, pan). In one walk video the actor runs in part of the video instead of walking. We compared our method against the descritor proposed by Laptev [8] (STIP). We used the implementation of Laptev available on http: // Laptev/download/stip-1.0-winlinux.zip. The KHT data set was divided in two parts of 50%. One part is used for training and the second one for testing Experiments We conducted numerous experiments that provide interesting and meaningful results. We tried out different sizes of codebook: 50, 100, 200 and 300 code words (clusters). For the features used, we worked with the default parameters of STIP descriptor. In the case of our proposed method, we first find the mask with the pixels that present motion in consecutive frame. We compute the motion of the current frame through the difference between previous frame and next frame. If the difference is over 20, we consider it as a pixel with motion. We also eliminate very small motion regions using an erode filter. Then SIFT keypoints are detected inside the motion regions. We also compute the corners using the Harris detector. We only consider the SIFT keypoints that have a corner response inside a predefined neighbourhood (16 16 pixels around the keypoint). In Tables 1, 2 and 3, we show the confusion matrix using the Harris-SIFT, Harris-SIFT with motion information (PCM), and STIP descriptors. We can see that Harris-SIFT detector performed a little bit worst than Harris-SIFT with

5 walk run jog box hclp hwav walk run jog box hclp hwav Table 1. Harris-SIFT confusion matrix. walk run jog box hclp hwav walk run jog box hclp hwav Table 2. Harris-SIFT + motion confusion matrix. motion vectors. Now, comparing the results of our proposed model with the well known STIP descriptor, we can find that our method is as good as STIP detector. In some cases it performed better than STIP. For the most part, confusion occurs between jogging and running sequences as well as between boxing and hand clapping sequences. We observed similar structure for all other methods as well. Now, evaluating the performance of Harris-SIFT against STIP, we can say that our descriptor works as well as STIP descriptor. The confusion between box and handclapping event is due to the small portion with motion (only hand motion). In our experiments, we set the same motion detection parameter. Adapting the parameter to the percentage of motion we could detect more points of interest. With this gain of information we could get better results. In Tables 4 and 5, we present the performance of Harris- SIFT with motion information (PCM) detector and the STIP detector. Here, we tested using four size of codebook. The performance is measure in precision and recall statistics. correct correct + missed These statistics are defined as: recall = correct and precision = correct + false. A good detector should have high precision and high recall. We can see that our detector performed better than STIP when the size of the code book was small. As the number of points of interest of STIP is small, reducing even more with a small code book, the performance showed to be even lower than using a bigger code book. In our method, we did not have this behavior since the number of points of interest were bigger than STIP. The difference in performance for small code book sizes and bigger ones was little. Thus, our method could work with smaller size of code books. walk run jog box hclp hwav walk run jog box hclp hwav Table 3. STIP confusion matrix. 50 clusters HarrisSIFT + Motion STIP Prec Recall Prec Recall walk run jog box hclp hwav Table 4. Comparison in precision/recall of Harris-SIFT against STIP using 50 clusters in the codebook. 8. Conclusion This paper considered event detection from a supervised classification perspective. We proposed a method that combines a 2D descriptor (Harris-SIFT) and motion vector histogram and correlation between corresponding block in previous frame and next frame for detecting the similarity in the current frame. We evaluated the entropy of each block in order to determine a similarity measure. Our method worked as well with low number of clusters as with large number of clusters. Thus, we can reduce the time complexity computing a small number of visual code words. The next step is to test our method with real life videos and see how it performs with multiple motion and dynamic background. Parameters for detecting the portion of motion must be set according to the type of event Acknowledgments The authors are grateful to CNPq, CAPES and FAPEMIG for the financial support of this work. References [1] G. Cámara-Chávez, F. Precioso, M. Cord, S. Phillip-Foliguet, and A. de Albuquerque Araújo. An interactive video contentbased retrieval system. In 15th International Conference on Systems, Signals and Image Processing, 2008, IWSSIP 2008, pages , Bratislava, June 2008.

6 100 clusters 200 clusters 300 clusters HarrisSIFT + Motion STIP HarrisSIFT + Motion STIP HarrisSIFT + Motion STIP Prec Recall Prec Recall Prec Recall Prec Recall Prec Recall Prec Recall walk run jog box hclp hwav Table 5. Comparison in precision/recall of Harris-SIFT against STIP using 100, 200 and 300 clusters in the codebook. [2] C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20(3): , [3] G. Csurka, C. R. Dance, L. Fan, J. Willamowski, and C. Bray. Visual categorization with bags of keypoints. In In Workshop on Statistical Learning in Computer Vision, ECCV, pages 1 22, [4] P. Dollar, V. Rabaud, G. W. Cottrell, and B. Serge. Behavior recognition via sparse spatio-temporal features. In Joint IEEE Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance (VS-PETS), [5] D. Frommer. Youtube hits 100 millon u.s. viewers, hulu surges. Silicon Alley Insider. Digital Business, December million-viewers-hulu-surges. [6] L. Gomes. Will all of us get our 15 minutes on a youtube video? The WallStreet Journal, [7] C. Harris and M. Stephens. A combined corner and edge detector. In Vision Conference, pages , [8] I. Laptev. On space-time interest points. International Journal of Computer Vision (IJCV), 63(2-3): , [9] Y. Linde, A. Buzo, and R. Gray. An algorithm fo vector quantizar design. IEEE Transactions on Communications, 28:84 94, [10] D. Lowe. Distintive image features from scale-invariant keypoints. In IJCV, volume 60, pages , [11] Z. Lu, H. Lou, and J. Pan. 3d model retrieval based on vector quantisation index histograms. Journal of Physics: Conference Series, 48: , [12] M. Marszaek and C. Schmid. Spatial weighting for bag-offeatures. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 2, pages , [13] K. Mikolajzyk and C. Schmid. Indexing based on scale invariant interest points. In International Conference on Computer Vision, pages , [14] C. Min, C. Shu-Ching, S. Mei-Ling, and K. Wickramaratna. Semantic event detection via multimodal data mining. IEEE Signal Processing Magazine, 23(2):38 46, March [15] M. Pickering, S. Rger, and D. Sinclair. Video retrieval by feature learning in key frames. In International Conference on Image and Video Retrieval, [16] R. Polana and R. Nelson. Recognition of motion from temporal texture. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR 1992), pages , Champaign, IL, USA, Jun [17] P. Quelhas, F. Monay, J. Odobez, D. Gatica-Perez, and T. Tuytelaars. A thousand words in a scene. IEEE Trans. on Pattern Analysis and Machine Intelligence, 29(9): , [18] C. Schmid, R. Mohr, and C. Bauckhage. Evaluation of interesting point detectors. International Journal of Computer Vision (IJCV), 37(2): , [19] C. Schuldt, I. Laptev, and B. Caputo. Recognizing human actions: a local svm approach. In 17th International Conference on Pattern Recognition 2004 (ICPR2004), volume 3, pages 32 36, Aug [20] E. Shechtman and M. Irani. Space-time behavior-based correlation. IEEE Trans. on Pattern Analysis and Machine Inteelligence, 29(11): , [21] C. Shu-Ching, S. Mei-Ling, C. Min, and Z. Chengcui. A decision tree-based multimodal data mining framework for soccer goal detection. In IEEE International Conference on- Multimedia and Expo (ICME 2004), volume 1, pages , June, [22] W. Shu-Fai and R. Cipolla. Extracting spatiotemporal interest points using global information. In IEEE 11th International Conference on Computer Vision, 2007 (ICCV2007), pages 1 8, Rio de Janeiro, Oct [23] J. Sivic and A. Zisserman. Video google: a text retrieval approach to object matching in videos. In 9th IEEE International Conference on Computer Vision (ICCV 2003), volume 2, pages , [24] S. Tong. Active Learning: Theory and Applications. PhD thesis, Stanford University, [25] F. Wang, Y.-G. Jiang, and C.-W. Ngo. Video event detection using motion relativity and visual relatedness. In MM 08: Proceeding of the 16th ACM international conference on Multimedia, pages , New York, NY, USA, ACM. [26] J. Z. Wang. Methodological review - wavelets and imaging informatics : A review of the literature. Journal of Biomedical Informatics, pages , July Avaliable on [27] N. Xu and W. dong Chen. A high real-time and robust object recognition and localization algorithm. China Journal of Image and Graphics, submitted 10/2007.

Extracting Spatio-temporal Local Features Considering Consecutiveness of Motions

Extracting Spatio-temporal Local Features Considering Consecutiveness of Motions Akitsugu Noguchi and Keiji Yanai Department of Computer Science, The University of Electro-Communications, 1-5-1 Chofugaoka,