Video Event Detection via Multi-modality Deep Learning

Size: px

Start display at page:

Download "Video Event Detection via Multi-modality Deep Learning"

Bertram Harris
5 years ago
Views:

2014 22nd International Conference on Pattern Recognition Video Event Detection via Multi-modality Deep Learning I-Hong Jhuo 1 and D.T.

1 nd International Conference on Pattern Recognition Video Event Detection via Multi-modality Deep Learning I-Hong Jhuo 1 and D.T. Lee 1,2 1 Institute of Information Science, Academia Sinica, Taipei, Taiwan 2 Dept. of Computer Science and Engineering, National Chung Hsing University, Taichung, Taiwan ihjhuo@gmail.com, dtlee@ieee.org Abstract Detecting complex video events based on audio and visual modalities is still a largely unresolved issue. While the conventional video representation methods extract each modality ineffectively, we propose a regularized multi-modality deep learning for video event detection. We first build an auto-encoder based on unconstrained minimization and adopt the conjugate gradient method with linear search for optimization. The learned auto-encoder can capture the relationship between the audio and visual modality corresponding to the same video event at each layer of the network. To make the network robust to local variance, we adopt the commonly used local contrast normalization and spatial maximum pooling to each modality for video representation. Compared with traditional methods using manually designed features, our method is more efficient. Experimental results on publicly available video event detection datasets demonstrate that the proposed method consistently outperforms the state-of-the-art video representation methods. 1. Introduction Automatic detection of complex multimedia events in diverse Internet videos is a fundamental computer vision and multimedia problem. Furthermore, automatic detection is imperative for real-life applications such as video retrieval and surveillance. A large portion of Internet videos are captured by amateur consumers under uncontrolled conditions and are uploaded without any professional post-processing. As a result, video event recognition becomes extremely challenging. Most previous approaches [5, 12] are considerably time consuming because of manually designed features for video representation. Two such examples are the space-time local volumes bag-of-word (BoW) based video representation [12] and the visual-based salient points with BoW representation [16], etc. Although the features of these two representations have potential to increase the accuracy of video event recognition, to effectively make use of them requires strong, domain-specific knowledge. Moreover, because of computational cost issues, extracting features for video content analysis is restricted, especially for large-scale data analysis. Recently, fusing multiple modalities such as audio and visual information has provided more promising results for video event detection than visual based information alone. Research on how to effectively and efficiently combine multiple modalities for more robust video representation has increased in popularity. Recent pioneering research [4, 8, 9, 24] incorporating both audio and visual modalities for video content analysis has demonstrated the effectiveness, and significantly improved the performance. Existing deep neural network methods yield excellent results for computer vision tasks [22, 23] of image/audio representation and have many desirable characteristics required for large-scale image/audio classification. Other than manually designing features, deep neural networks provide another method to automatically extract features from raw pixels. For example, in layer-wise stacking of basic blocks, such as Convolutional Neural Nets (CNN) [14] and Reconstruction Independent Component Analysis (RICA) [13], deep neural networks gradually extract more semantic, meaningful features in higher layers. It is worth noting that on the extremely challenging ImageNet classification task [14, 13], deep neural network based methods outperform traditional methods based on manually designed features by a significant margin. Since deep neural network is a feedforward network in the test stage for image/audio representation, it not only improves classification results, but also reduces the computational cost, and has important characteristics required for largescale image/audio classification [14, 15, 23]. In this paper, we propose a single layer neural network for video representation and event detection. In particular, inspired by the good performance of /14 $ IEEE DOI /ICPR

2 Figure 1: Framework of our proposed method for video representation. This framework is based on multiple modalities and a deep neural network. First, we simultaneously extract local patches from audio signals and visual keyframes from the same video, and then randomly select the patches to learn the auto-encoders, i.e., weights W i a and W i v, in an unsupervised manner. When the auto-encoders are learned, the spatial maximum pooling and local contrast normalization (LCN), i.e., local subtractive and divisive normalization [17], are adopted to each transformed input audio/keyframe data where the input data is represented by the learned weights. The final step of the framework shows the architecture used for each audio and keyframe data in our network. This figure is best viewed in color. RICA [13] on ImageNet and motivated by deep neural networks, we propose a regularized, multi-modality deep learning algorithm and employ it as the basic building block to build a deep neural network. The proposed method simultaneously encodes the relationships between the visual and audio modalities. The video representation outperforms manually designed featurebased video representation in both accuracy and efficiency of video event detection. 2. Related Work Over recent years, there are several interesting works on video content analysis based on fusing multiple modalities. For example, early fusion strategy [9] is to average multiple kernel matrices based on audio and visual modalities before classification. Contrary to early fusion, Natarajan et al. [18] utilized late fusion strategy to average the prediction scores based on multiple independently trained classifiers for video event detection. Moreover, a joint probability model of audio and visual modalities was developed by Beal et al. [2] for object detection in videos. Jiang et al. [10] proposed Audio- Visual Grouplets (AVGL) by exploring temporal interactions between audio and visual features. Each AVGL is defined as a set of audio and visual codewords, which are grouped together with audio and visual modalities for video concept classification. Moreover, Cristani et al. [4] simultaneously synchronized foreground objects and audio sounds for object detection in videos. There are numerous papers, utilizing audio and visual information for video content analysis [9, 25]. The above mentioned methods are focusing on the traditional manually designed features for visual and audio information, the handling of which is time consuming and has no straightforward way for feature fusion. On the other hand, building blocks of deep neural networks have received increasing attention in computer vision, especially image representation. These building blocks can be roughly categorized into either the global image representation based building blocks (GIR2B) strategy or the local image patch based building blocks (LIP2B) strategy. Complete image level training, or GIR2B, requires plenty of training samples to train networks; this method is not applicable when the number of the training samples is greatly reduced or restricted. GIR2B strategy based works include Auto-Encoder (AE), Restricted Boltzmann Machine (RBM) [6], and other extensions of RBM and AE, such as Stacked Denoising Auto-Encoder [23], Deep Belief Machine (DBM), and Contractive Autoencoder [21]. In contrast to GIR2B, recent studies employing the LIP2B strategy include Reconstruction Independent Component Analysis (RICA) [13], Deconvolutional Networks (DN) [26], and Convolutional Neural Nets (CNN) [15]. These building blocks generally operate on the image patch level; as such, there are sufficient patches to stably train a network. Compared to the GIR2B strategy, the LIP2B strategy is more flexible when dealing with intra-class variance. As a result, the LIP2B strategy usually achieves significant performance improvements on very challenging image classification datasets such as ImageNet, Caltech 101, and Caltech 256. In addition to the above mentioned single modality based deep neural networks, both Ngiam et al. [19] and Srivastava et al. [22] have recently adopted multi-modality deep neural networks for signal processing tasks. Both of these proposed architectures utilize the global image representation. However, these methods are considerably restrictive for real-life applications. In particular, these two architectures force the hidden states of the multiple modalities to be identical, which neglect the diverse qualities of different modalities. Motivated by deep learning works, we develop a multi-modality deep learning algorithm to encode the relationships between visual and audio information for video event detection. 3. Multi-modality Deep Learning Framework In this section, we describe the process of our proposed regularized multi-modality deep learning algorithm. This method includes the following mod- 667

3 ules: (1) Preprocessing the input data using the whitening method, (2) Learning the audio and visual auto-encoders in an unsupervised manner, (3) Spatial maximum pooling and local contrast normalization (LCN) [11], and finally, (4) concatenating the learned audio and visual representation for each video representation. We will explain each of these modules in the following subsections. We mathematically define the variables of our neural network in the following sections. We use {x i f }n i=1 R p to index unlabeled video keyframe patches and unlabeled audio signal patches. The superscript i is used to index the patch number, and the subscript f is used to indecate whether the patch is a visual keyframe patch or audio signal patch. In particular, x i v corresponds to a visual keyframe patch, and x i a corresponds to an audio signal patch. The visual/audio patch pair with the same superscript i, i.e., x i v and x i a, corresponds to the pair of visual keyframe and audio signal patches collected from the same video. In our neural network, features are learned from raw pixels. All patches are gathered and organized into the matrix form: X = [x 1 v,...,x n v,x 1 a,...,x n a], where n denotes the number of visual keyframe/audio signal patches Input Data Preprocessing Motivated by the success of studies in deep neural network communities, the whitening preprocessing step is suggested and used to de-correlate the input data [14]. We also adopt the whitening process before learning the auto-encoders from videos. In particular, we normalize each feature patch x i v and x i a by subtracting the mean of all its entries and then, we divide by the standard deviation of all its entries. In our experiments, we discovered that the whitening preprocess is essential for ensuring a good performance of our neural network based video event detection Unsupervised Feature Learning with Multimodality We first introduce the Independent Components Analysis (ICA) [7] algorithm for a set of input visual data X v corresponding to the features of all patches {x i v} n i=1. The goal of the ICA algorithm [7] is to learn the auto-encoders in an unsupervised manner. The objective function of this algorithm is n m min W i=1 j=1 v φ(w v j x i v ), s.t Wv W v = I, where W v R p m is the learned weight matrix (p is number of features/ patches), Wv j is the jth row of W v, and φ is a nonlinear convex function such as L 1 penalty: φ( ) = (log cosh( )) described in [13]. However, this method has difficulty in learning overcomplete filters based on the orthogonality constraint Wv W v = I. In [13], the hard orthogonal constraint in ICA is relaxed with a soft reconstruction cost; this leads to the RICA objective function min W v ( 1 n v n n m W v Wv x i v x i v 2 2+α φ(wv j x i v )) i=1 i=1 j=1 (1) where W v represents the encoders and decoders. Given the smooth penalty 1 in Eq. (1), the unconstrained problem can overcome the overcompleteness problem, as well as minimize the cost function efficiently. By taking advantage of additional image information in the classification problem, recent research [3] has utilized RGB-D images (color images (RGB) with an additional depth channel (D)) to learn from three-dimensional features for object recognition. Inspired by the success of these studies, we propose a regularized multimodality deep learning algorithm to learn visual and audio auto-encoders (also called filters), from unlabeled video event data. This learning reconstruction problem can be formulated in terms of the following objective function. min W a,w v n ( W a Wa x i a x i a W v Wv x i v x i v 2 2 i=1 + α( W a x i a W v x i v 2 2)+β W v x i v W a x i a 2 2) (2) where α and β are parameters, x i a is an audio feature patch, and x i v is a visual feature patch. For learning audio/visual encoders W a and W v respectively, we adopt the off-the-shelf conjugate gradient algorithm to resolve the problem. Similar to RICA, our proposed method is optimized efficiently by our formulation 3.3. Spatial Pooling and Normalization We separately map the visual and audio patches of a video to obtain the new visual and audio feature representations after weights W v and W a are obtained via our proposed deep learning algorithm. For each new visual/audio representation, we sequentially use spatial maximum pooling and local contrast normalization (LCN) [11] methods for preserving the local invariance property. It is worth noting that the local subtractive normalization in the LCN process can remove the weighted average of neighboring pixels from the current pixel. For the sake of simplification, we randomly sample some visual and audio patches to learn W v and W a in our neural network. Once learning is complete for the auto-encoders, we apply them to all visual and audio input data for video representation. In this paper, we use the basic building block (auto-encoders learning, Pooling, and LCN) and concatenate both of the visual and 1 The smooth penalty is called reconstruction cost and corresponds to reconstruction cost of a linear auto-encoder [13]. 668

4 audio features for video representation. In summary, the auto-encoders learning algorithm, spatial maximum pooling, and LCN processes are three sublayers in our architecture model. Figure 1 illustrates the overview of our framework and the mapping architecture model. 4. Experiments In this section, we test our proposed method on two publicly available video event detection datasets, TRECVID MED 2011 Development [18] (MED 2011) and Columbia Consumer Video (CCV) Dataset [8]. We also compare our regularized multi-modality deep learning algorithm with the following methods which achieve state-of-the-art for video event detection. (1) Single Visual Feature (SVF). We follow the standard experimental setting and only report the best performance among one of the visual features mentioned in [24]. (2) Early Fusion (EF). A high dimensional vector is used to represent each video based on concatenating SIFT, STIP [12] and MFCC [20] features and the size of final dimension is 14, 000. (3) Late Fusion (LF). We train each classifier based on each feature and average the output scores from the classifiers for event detection. (4) Reconstruct Independent Components Analysis (RICA) [7]. We follow standard experimental setting [7] and combine the visual modality with audio modality as input for the unsupervised learning. (5) Our proposed regularized multi-modality autoencoders (RMAE) method. In all experiments, we simultaneously learn the auto-encoders with the help of audio and visual modalities. We use one layer neural network on all datasets and report the performance based on the concatenation feature extraction. (6) The state-of-the-art video event detection method in [24]. Bi-Modal Bag-of-Word with Maximum Pooling (BMBoW-MP) and the size of bi-modal codebook is set to be 4, 000 in our experiments. Baseline Feature Extraction. We follow the setting of bag-of-word feature representations in [8]. For visual features, we adopt the sparse keypoint visual detector, and Difference of Gaussians [16], to find local keypoints in each keyframe. Each keypoint is described by a 128 dimensional vector. The feature is further quantized into 5, 000-dimensional Bag-of-Words (BoW) histogram. In addition, we extract MFCC audio features [20] from the video contents for the baseline feature extraction and then the MFCC features are quantized into a 4, 000-dimensional BoW histogram. Specifically, we sample one frame from every one second of video for the visual and audio features. Experimental Setting. For visual modality, we first extract each keyframe from each video based on one second period and resize each keyframe into pixels, and then extract each keyframe patch by using the overlapped patch size equal to one pixel, where the overlapped patch size indicates the distance between two neighboring patches. Different from visual modality, we extract each audio signal patch from each video based on the overlapped signal size equal to one second during the MFCC feature extraction. Overall in our experiment, we set the total number of 800, 000 randomly selected visual and audio patches for unsupervised learning and set the dimensions of W a and W v to be 100 and 200, respectively. Since the sizes of keyframes in each video are different, we average all keyframe vectors for a video representation. In addition, to de-correlate the input data [13], we preprocess our input data before unsupervised learning. We individually normalize the high dimensional data by subtracting the mean and dividing by the standard deviation before our unsupervised learning. For video event detection, we use the one-vs-all SVM as the classifier and employ the Average Precision (AP) as the evaluation metric of event detection. We calculate AP for each event measure and use the Mean Average Precision (MAP) to evaluate all the events of the datasets as the final measurement. To determine the parameter setting of our method, we vary the values of α and β during unsupervised learning, and then choose the optimal values based on validation performance. For the parameter of the SVM classifier, we vary the parameter C of SVM and then choose the best value based on five-fold cross validation. We apply χ 2 kernel as the kernel matrix for SVM classifier, which is calculated as k(i, j) =e d χ 2 (i,j) η where η is by default set as the mean value of all pairwise distances in the training set. Our proposed framework is implemented on the MATLAB platform of a Sixteen-Core Intel Xeon Processor X4860 with 2.26 GHz CPU and 64 GB memory, and we observe that the unsupervised learning process can be finished fast. For example, in the Columbia Consumer Video Dataset, one iteration of computing the auto-encoders takes less than 3 seconds and total time needed for learning the auto-encoders is within 3 hours Experiment on TRECVID MED 2011 Development Dataset TRECVID Multimedia Event Detection (MED) is a challenging task to detect the complicated highlevel events. We first evaluate our proposed method on TRECVID MED 2011 development dataset. This dataset includes five events Attempting a board trick, 669

5 Figure 2: AP Performance comparison of different methods on TRECVID MED 2011 dataset. This figure is best viewed in color. Feeding an animal, Landing a fish, Wedding ceremony, and Working on a woodworking project and one background class and it in total consists of 10, 804 videos from 17, 566 minutes of web videos, which is partitioned into a training set (8, 783 videos) and a test set (2, 021 videos). It is worth noting that the dataset only consists of about 100 positive videos in training set for each event. The per-event performance for all the methods in comparison is shown in Figure 2. From the experimental results, we have the following observations: (1) Our proposed RMAE video representation produces better results than all the other baseline methods in terms of MAP, with significant performance improvements on the five events. (2) The RICA and RMAE video representation methods outperform the single visual feature (SVF), early fusion (EF) and late fusion (LF) methods by a large margin. This is due to the fact that the neural network based methods encode the useful audio and visual information while SVF only considers visual information. Both of EF and LF methods simply fuse audio and visual features/scores in a superficial way without exploring their dependence. (3) All the multi-modality methods, including RMAE, RICA and the state-of-theart methods obtain promising results better than those based on a single feature, which confirms the superiority of considering multi-modality in the task of video event detection. (4) Compared with RMAE and RICA methods, our proposed method has significant improvement as RMAE captures the relationships between audio and visual modalities from the same category of videos. For example, the performance difference between RICA and RMAE in Attempt a board trick is larger than in Wedding ceremony category. This is because that Attempting board trick event usually includes simpler hitting sounds when the visual content shows the board trick actions. In contrast, Wedding ceremony event oftentimes involves more complex sounds, such as applause, various background music, which makes the event detection more difficult. In addition, our proposed method achieves better results Figure 3: Compared performance of each single modality on two benchmarks. than BMBoW-MP method. Although bi-modal bag-ofword concurrently captures both of the audio and visual information, there still remain some isolated audio and visual words after performing the bipartite partitioning algorithm. Therefore, the bi-modal BoW representation may incur information loss. In contrast, our proposed method utilizes the learned audio and visual auto-encoders to represent each video with minimal information loss Experiment on Columbia Consumer Video (CCV) Dataset We use the Columbia Consumer Video (CCV) benchmark dataset [8] in our second experiment. This dataset includes 9, 317 YouTube videos annotated over 20 different semantic categories. There are 4, 659 videos designing for training and the remaining 4, 658 videos for testing. The per-event performance of all the methods in comparison is shown in Figure 4, where we follow the setting in [24] and the size of bi-modal codebook is set as 6, 000. From the results, we can see the following results: (1) the proposed RMAE method achieves the best performance in terms of MAP. In particular, it outperforms the SVF, LF and BMBoW-MP by 9.89%, 6.52% and 1.21% respectively, which clearly demonstrates that our method is superior to all the baseline feature representations. (2) RMAE method achieves the best performance on most of the event categories. For instance, on event bird, our method outperforms the best baseline method SVF by 7.61%. (3) Compared with the baseline LF between the object and event concept categories, our method achieves a relative high performance gain on category music performance than dog. The reason may be that the music performance category oftentimes includes certain background audio information than the dog category in which LF method can not successfully capture the relationships between multiple modalities. For example, the background music in dog category includes not only barking of dogs but also noise from the environment, which is indistinguishable for classification. (4) In the most event/ scene categories, i.e., excluding biking, bird and parade 670

6 Figure 4: Per-category performance comparison on CCV dataset. categories, the proposed method achieves better results than RICA method, which demonstrates the final term in Eq. 2 takes the advantage of the correlations between audio and visual modalities. In general, we expect high performance impact of the proposed RMAE method that automatically captures the relationships between multiple modalities. Figure 3 shows the compared performance of each modality on the two datasets. As we can see, the proposed RMAE method obtains the best performance than using each individual modality. This confirms the conclusion made in [24], that inclusion of audio modality can significantly improve the video event detection performance. Moreover, the audio modality has less information loss on CCV dataset than TRECVID MED 2011 dataset, since the audio feature can be extracted with distinguishable patterns in certain categories such as music performance event. Compared with SVF and RAME methods, we can see that the proposed method has attained better results than using SIFT BoW feature extraction. This makes our proposed method to be a more robust visual representation for video detection. 5. Conclusion In this work, we have introduced a regularized multimodality deep learning algorithm to improve conventional feature extraction for video event detection. The proposed method not only learns the audio and visual auto-encoders during unsupervised learning but also encodes the relationships between audio and visual modalities with minimal information loss in the same video. In addition, with the help of spatial maximum pooling and local contrast normalization, features learned from our neural network are robust to local variances, i.,e., adopting LCN and pooling methods improves for the video event detection. Extensive experiments on two public video event benchmarks consistently show that the proposed method significantly outperforms the manually designed features and fusion methods. In addition, compared with RICA based deep neural network, our method achieves the best performance. The promising results show the effectiveness of our proposed multi-modality deep learning for video event detection. References [1] [2] M. Beal, N. Jojic, and H. Attias. A Graphical Model for Audiovisual Object Tracking. IEEE TPAMI, [3] M. Blum, J. Springenberg, J. Wlfing, and M. Riedmiller. A Learned Feature Descriptor for Object Recognition in RGB-D Data. In ICRA 12. [4] M. Cristani, M. Bicego, and V. Murino. Audio-visual Event Recognition in Surveillance Video Sequences. IEEE TMM, [5] A. Frome, D. Huber, R. Kolluri, T. Bulow, and J. Malik. Recognizing Objects in Range Data Using Regional Point Descriptors. In ECCV 04. [6] G. E. Hinton, S. Osindero, and Y.-W. Teh. A Fast Learning Algorithm for Deep Belief Nets. In Neural Computation, 18(7): , [7] A. Hyvarinen, J. Karhunen, and E. Oja. Independent Component Analysis. In Wiley Interscience, [8] Y.-G. Jiang, G. Ye, S.-F. Chang, D. Ellis and, A. Loui. Consumer Video Understanding: A Benchmark Database and an Evaluation of Human and Machine Performance. In ICMR, [9] Y.-G. Jiang, S. Bhattacharya, S.-F. Chang and M. Shah. High-Level Event Recognition in Unconstrained Videos. In IJMIR, [10] W. Jiang and A. Loui. Audio-visual Grouplet: Temporal Audio-visual Interactions for General Video Concept Classification. In ACM MM, [11] K. Jarrett, K. Kavukcuoglu, M. A. Ranzato, and Y. LeCun. What is the Best Multi-stage Architecture for Object Recognition? In ICCV 09. [12] I. Laptev and T. Lindeberg. On Space-time Interest Points. IJCV, [13] Q. V. Le, A. Karpenko, J. Ngiam, and A. Y. Ng. ICA with Reconstruction Cost for Efficient Overcomplete Feature Learning. In NIPS 11. [14] Q. V. Le, J. Ngiam, Z. Chen, D. Chia, P. Koh, and A. Y. Ng. Tiled Convolutional Neural Networks. In NIPS 10. [15] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation Applied to Handwritten Zip Code Rsecognition. In Neural Computation, 1(4): , [16] D. Lowe. Distinctive Image Features from Scale-invariant Keypoints. IJCV, [17] S. Lyu, and E. Simoncelli. Nonlinear Image Representation Using Divisive Normalization. In ICCV 09. [18] P. Natarajan, V. Manohar, S. Wu, S. Tsakalidis, S. N. Vitaladevumi, X. Zhuang, R. Prasad, G. Ye, D. Liu, I.-H. Jhuo, S.-F. Chang, H. Izadinia I. Saleemi and M. Shah BBN VISER TRECVID 2011 Multimedia Event Detection System. In NIST TRECVID Workshop, [19] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A.Y. Ng. Multimodal Deep Learning. In ICML 11. [20] L. Pols. Spectral Analysis and Identification of Dutch Vowels in Monosyllabic Words. Doctoral dissertion, [21] S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio. Contractive Auto-encoders: Explicit Invariance during Feature Extraction. In ICML 11. [22] N. Srivastava, and R. Salakhutdinov. Multimodal Learning with Deep Boltzmann Machines. In NIPS 12, 25, pages [23] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol. Stacked Denoising Autoencoders :Learning Useful Representations in A Deep Network with A Local Denoising Criterion. In JMLR 10, 11(5): , [24] G. Ye, I.-H. Jhuo, D. Liu, Y.-G. Jiang, D.T. Lee, and S.-F. Chang. Joint Audio-Visual Bi-Modal Codewords for Video Event Detection. In ICMR 12. [25] J.-C. Wang, Y.-H. Yang, I.-H. Jhuo, Y.-Y. Lin and H.-M. Wang. The Acousticvisual Emotion Guassians Model for Automatic Generation of Music Video. In ACM MM, [26] M. Zeiler, D. Krishnan, G. Taylor, and R. Fergus. Deconvolutional Networks. In CVPR

Consumer Video Understanding

Consumer Video Understanding A Benchmark Database + An Evaluation of Human & Machine Performance Yu-Gang Jiang, Guangnan Ye, Shih-Fu Chang, Daniel Ellis, Alexander C. Loui Columbia University Kodak Research