Video Event Detection via Multi-modality Deep Learning

Size: px
Start display at page:

Download "Video Event Detection via Multi-modality Deep Learning"

Transcription

1 nd International Conference on Pattern Recognition Video Event Detection via Multi-modality Deep Learning I-Hong Jhuo 1 and D.T. Lee 1,2 1 Institute of Information Science, Academia Sinica, Taipei, Taiwan 2 Dept. of Computer Science and Engineering, National Chung Hsing University, Taichung, Taiwan ihjhuo@gmail.com, dtlee@ieee.org Abstract Detecting complex video events based on audio and visual modalities is still a largely unresolved issue. While the conventional video representation methods extract each modality ineffectively, we propose a regularized multi-modality deep learning for video event detection. We first build an auto-encoder based on unconstrained minimization and adopt the conjugate gradient method with linear search for optimization. The learned auto-encoder can capture the relationship between the audio and visual modality corresponding to the same video event at each layer of the network. To make the network robust to local variance, we adopt the commonly used local contrast normalization and spatial maximum pooling to each modality for video representation. Compared with traditional methods using manually designed features, our method is more efficient. Experimental results on publicly available video event detection datasets demonstrate that the proposed method consistently outperforms the state-of-the-art video representation methods. 1. Introduction Automatic detection of complex multimedia events in diverse Internet videos is a fundamental computer vision and multimedia problem. Furthermore, automatic detection is imperative for real-life applications such as video retrieval and surveillance. A large portion of Internet videos are captured by amateur consumers under uncontrolled conditions and are uploaded without any professional post-processing. As a result, video event recognition becomes extremely challenging. Most previous approaches [5, 12] are considerably time consuming because of manually designed features for video representation. Two such examples are the space-time local volumes bag-of-word (BoW) based video representation [12] and the visual-based salient points with BoW representation [16], etc. Although the features of these two representations have potential to increase the accuracy of video event recognition, to effectively make use of them requires strong, domain-specific knowledge. Moreover, because of computational cost issues, extracting features for video content analysis is restricted, especially for large-scale data analysis. Recently, fusing multiple modalities such as audio and visual information has provided more promising results for video event detection than visual based information alone. Research on how to effectively and efficiently combine multiple modalities for more robust video representation has increased in popularity. Recent pioneering research [4, 8, 9, 24] incorporating both audio and visual modalities for video content analysis has demonstrated the effectiveness, and significantly improved the performance. Existing deep neural network methods yield excellent results for computer vision tasks [22, 23] of image/audio representation and have many desirable characteristics required for large-scale image/audio classification. Other than manually designing features, deep neural networks provide another method to automatically extract features from raw pixels. For example, in layer-wise stacking of basic blocks, such as Convolutional Neural Nets (CNN) [14] and Reconstruction Independent Component Analysis (RICA) [13], deep neural networks gradually extract more semantic, meaningful features in higher layers. It is worth noting that on the extremely challenging ImageNet classification task [14, 13], deep neural network based methods outperform traditional methods based on manually designed features by a significant margin. Since deep neural network is a feedforward network in the test stage for image/audio representation, it not only improves classification results, but also reduces the computational cost, and has important characteristics required for largescale image/audio classification [14, 15, 23]. In this paper, we propose a single layer neural network for video representation and event detection. In particular, inspired by the good performance of /14 $ IEEE DOI /ICPR

2 Figure 1: Framework of our proposed method for video representation. This framework is based on multiple modalities and a deep neural network. First, we simultaneously extract local patches from audio signals and visual keyframes from the same video, and then randomly select the patches to learn the auto-encoders, i.e., weights W i a and W i v, in an unsupervised manner. When the auto-encoders are learned, the spatial maximum pooling and local contrast normalization (LCN), i.e., local subtractive and divisive normalization [17], are adopted to each transformed input audio/keyframe data where the input data is represented by the learned weights. The final step of the framework shows the architecture used for each audio and keyframe data in our network. This figure is best viewed in color. RICA [13] on ImageNet and motivated by deep neural networks, we propose a regularized, multi-modality deep learning algorithm and employ it as the basic building block to build a deep neural network. The proposed method simultaneously encodes the relationships between the visual and audio modalities. The video representation outperforms manually designed featurebased video representation in both accuracy and efficiency of video event detection. 2. Related Work Over recent years, there are several interesting works on video content analysis based on fusing multiple modalities. For example, early fusion strategy [9] is to average multiple kernel matrices based on audio and visual modalities before classification. Contrary to early fusion, Natarajan et al. [18] utilized late fusion strategy to average the prediction scores based on multiple independently trained classifiers for video event detection. Moreover, a joint probability model of audio and visual modalities was developed by Beal et al. [2] for object detection in videos. Jiang et al. [10] proposed Audio- Visual Grouplets (AVGL) by exploring temporal interactions between audio and visual features. Each AVGL is defined as a set of audio and visual codewords, which are grouped together with audio and visual modalities for video concept classification. Moreover, Cristani et al. [4] simultaneously synchronized foreground objects and audio sounds for object detection in videos. There are numerous papers, utilizing audio and visual information for video content analysis [9, 25]. The above mentioned methods are focusing on the traditional manually designed features for visual and audio information, the handling of which is time consuming and has no straightforward way for feature fusion. On the other hand, building blocks of deep neural networks have received increasing attention in computer vision, especially image representation. These building blocks can be roughly categorized into either the global image representation based building blocks (GIR2B) strategy or the local image patch based building blocks (LIP2B) strategy. Complete image level training, or GIR2B, requires plenty of training samples to train networks; this method is not applicable when the number of the training samples is greatly reduced or restricted. GIR2B strategy based works include Auto-Encoder (AE), Restricted Boltzmann Machine (RBM) [6], and other extensions of RBM and AE, such as Stacked Denoising Auto-Encoder [23], Deep Belief Machine (DBM), and Contractive Autoencoder [21]. In contrast to GIR2B, recent studies employing the LIP2B strategy include Reconstruction Independent Component Analysis (RICA) [13], Deconvolutional Networks (DN) [26], and Convolutional Neural Nets (CNN) [15]. These building blocks generally operate on the image patch level; as such, there are sufficient patches to stably train a network. Compared to the GIR2B strategy, the LIP2B strategy is more flexible when dealing with intra-class variance. As a result, the LIP2B strategy usually achieves significant performance improvements on very challenging image classification datasets such as ImageNet, Caltech 101, and Caltech 256. In addition to the above mentioned single modality based deep neural networks, both Ngiam et al. [19] and Srivastava et al. [22] have recently adopted multi-modality deep neural networks for signal processing tasks. Both of these proposed architectures utilize the global image representation. However, these methods are considerably restrictive for real-life applications. In particular, these two architectures force the hidden states of the multiple modalities to be identical, which neglect the diverse qualities of different modalities. Motivated by deep learning works, we develop a multi-modality deep learning algorithm to encode the relationships between visual and audio information for video event detection. 3. Multi-modality Deep Learning Framework In this section, we describe the process of our proposed regularized multi-modality deep learning algorithm. This method includes the following mod- 667

3 ules: (1) Preprocessing the input data using the whitening method, (2) Learning the audio and visual auto-encoders in an unsupervised manner, (3) Spatial maximum pooling and local contrast normalization (LCN) [11], and finally, (4) concatenating the learned audio and visual representation for each video representation. We will explain each of these modules in the following subsections. We mathematically define the variables of our neural network in the following sections. We use {x i f }n i=1 R p to index unlabeled video keyframe patches and unlabeled audio signal patches. The superscript i is used to index the patch number, and the subscript f is used to indecate whether the patch is a visual keyframe patch or audio signal patch. In particular, x i v corresponds to a visual keyframe patch, and x i a corresponds to an audio signal patch. The visual/audio patch pair with the same superscript i, i.e., x i v and x i a, corresponds to the pair of visual keyframe and audio signal patches collected from the same video. In our neural network, features are learned from raw pixels. All patches are gathered and organized into the matrix form: X = [x 1 v,...,x n v,x 1 a,...,x n a], where n denotes the number of visual keyframe/audio signal patches Input Data Preprocessing Motivated by the success of studies in deep neural network communities, the whitening preprocessing step is suggested and used to de-correlate the input data [14]. We also adopt the whitening process before learning the auto-encoders from videos. In particular, we normalize each feature patch x i v and x i a by subtracting the mean of all its entries and then, we divide by the standard deviation of all its entries. In our experiments, we discovered that the whitening preprocess is essential for ensuring a good performance of our neural network based video event detection Unsupervised Feature Learning with Multimodality We first introduce the Independent Components Analysis (ICA) [7] algorithm for a set of input visual data X v corresponding to the features of all patches {x i v} n i=1. The goal of the ICA algorithm [7] is to learn the auto-encoders in an unsupervised manner. The objective function of this algorithm is n m min W i=1 j=1 v φ(w v j x i v ), s.t Wv W v = I, where W v R p m is the learned weight matrix (p is number of features/ patches), Wv j is the jth row of W v, and φ is a nonlinear convex function such as L 1 penalty: φ( ) = (log cosh( )) described in [13]. However, this method has difficulty in learning overcomplete filters based on the orthogonality constraint Wv W v = I. In [13], the hard orthogonal constraint in ICA is relaxed with a soft reconstruction cost; this leads to the RICA objective function min W v ( 1 n v n n m W v Wv x i v x i v 2 2+α φ(wv j x i v )) i=1 i=1 j=1 (1) where W v represents the encoders and decoders. Given the smooth penalty 1 in Eq. (1), the unconstrained problem can overcome the overcompleteness problem, as well as minimize the cost function efficiently. By taking advantage of additional image information in the classification problem, recent research [3] has utilized RGB-D images (color images (RGB) with an additional depth channel (D)) to learn from three-dimensional features for object recognition. Inspired by the success of these studies, we propose a regularized multimodality deep learning algorithm to learn visual and audio auto-encoders (also called filters), from unlabeled video event data. This learning reconstruction problem can be formulated in terms of the following objective function. min W a,w v n ( W a Wa x i a x i a W v Wv x i v x i v 2 2 i=1 + α( W a x i a W v x i v 2 2)+β W v x i v W a x i a 2 2) (2) where α and β are parameters, x i a is an audio feature patch, and x i v is a visual feature patch. For learning audio/visual encoders W a and W v respectively, we adopt the off-the-shelf conjugate gradient algorithm to resolve the problem. Similar to RICA, our proposed method is optimized efficiently by our formulation 3.3. Spatial Pooling and Normalization We separately map the visual and audio patches of a video to obtain the new visual and audio feature representations after weights W v and W a are obtained via our proposed deep learning algorithm. For each new visual/audio representation, we sequentially use spatial maximum pooling and local contrast normalization (LCN) [11] methods for preserving the local invariance property. It is worth noting that the local subtractive normalization in the LCN process can remove the weighted average of neighboring pixels from the current pixel. For the sake of simplification, we randomly sample some visual and audio patches to learn W v and W a in our neural network. Once learning is complete for the auto-encoders, we apply them to all visual and audio input data for video representation. In this paper, we use the basic building block (auto-encoders learning, Pooling, and LCN) and concatenate both of the visual and 1 The smooth penalty is called reconstruction cost and corresponds to reconstruction cost of a linear auto-encoder [13]. 668

4 audio features for video representation. In summary, the auto-encoders learning algorithm, spatial maximum pooling, and LCN processes are three sublayers in our architecture model. Figure 1 illustrates the overview of our framework and the mapping architecture model. 4. Experiments In this section, we test our proposed method on two publicly available video event detection datasets, TRECVID MED 2011 Development [18] (MED 2011) and Columbia Consumer Video (CCV) Dataset [8]. We also compare our regularized multi-modality deep learning algorithm with the following methods which achieve state-of-the-art for video event detection. (1) Single Visual Feature (SVF). We follow the standard experimental setting and only report the best performance among one of the visual features mentioned in [24]. (2) Early Fusion (EF). A high dimensional vector is used to represent each video based on concatenating SIFT, STIP [12] and MFCC [20] features and the size of final dimension is 14, 000. (3) Late Fusion (LF). We train each classifier based on each feature and average the output scores from the classifiers for event detection. (4) Reconstruct Independent Components Analysis (RICA) [7]. We follow standard experimental setting [7] and combine the visual modality with audio modality as input for the unsupervised learning. (5) Our proposed regularized multi-modality autoencoders (RMAE) method. In all experiments, we simultaneously learn the auto-encoders with the help of audio and visual modalities. We use one layer neural network on all datasets and report the performance based on the concatenation feature extraction. (6) The state-of-the-art video event detection method in [24]. Bi-Modal Bag-of-Word with Maximum Pooling (BMBoW-MP) and the size of bi-modal codebook is set to be 4, 000 in our experiments. Baseline Feature Extraction. We follow the setting of bag-of-word feature representations in [8]. For visual features, we adopt the sparse keypoint visual detector, and Difference of Gaussians [16], to find local keypoints in each keyframe. Each keypoint is described by a 128 dimensional vector. The feature is further quantized into 5, 000-dimensional Bag-of-Words (BoW) histogram. In addition, we extract MFCC audio features [20] from the video contents for the baseline feature extraction and then the MFCC features are quantized into a 4, 000-dimensional BoW histogram. Specifically, we sample one frame from every one second of video for the visual and audio features. Experimental Setting. For visual modality, we first extract each keyframe from each video based on one second period and resize each keyframe into pixels, and then extract each keyframe patch by using the overlapped patch size equal to one pixel, where the overlapped patch size indicates the distance between two neighboring patches. Different from visual modality, we extract each audio signal patch from each video based on the overlapped signal size equal to one second during the MFCC feature extraction. Overall in our experiment, we set the total number of 800, 000 randomly selected visual and audio patches for unsupervised learning and set the dimensions of W a and W v to be 100 and 200, respectively. Since the sizes of keyframes in each video are different, we average all keyframe vectors for a video representation. In addition, to de-correlate the input data [13], we preprocess our input data before unsupervised learning. We individually normalize the high dimensional data by subtracting the mean and dividing by the standard deviation before our unsupervised learning. For video event detection, we use the one-vs-all SVM as the classifier and employ the Average Precision (AP) as the evaluation metric of event detection. We calculate AP for each event measure and use the Mean Average Precision (MAP) to evaluate all the events of the datasets as the final measurement. To determine the parameter setting of our method, we vary the values of α and β during unsupervised learning, and then choose the optimal values based on validation performance. For the parameter of the SVM classifier, we vary the parameter C of SVM and then choose the best value based on five-fold cross validation. We apply χ 2 kernel as the kernel matrix for SVM classifier, which is calculated as k(i, j) =e d χ 2 (i,j) η where η is by default set as the mean value of all pairwise distances in the training set. Our proposed framework is implemented on the MATLAB platform of a Sixteen-Core Intel Xeon Processor X4860 with 2.26 GHz CPU and 64 GB memory, and we observe that the unsupervised learning process can be finished fast. For example, in the Columbia Consumer Video Dataset, one iteration of computing the auto-encoders takes less than 3 seconds and total time needed for learning the auto-encoders is within 3 hours Experiment on TRECVID MED 2011 Development Dataset TRECVID Multimedia Event Detection (MED) is a challenging task to detect the complicated highlevel events. We first evaluate our proposed method on TRECVID MED 2011 development dataset. This dataset includes five events Attempting a board trick, 669

5 Figure 2: AP Performance comparison of different methods on TRECVID MED 2011 dataset. This figure is best viewed in color. Feeding an animal, Landing a fish, Wedding ceremony, and Working on a woodworking project and one background class and it in total consists of 10, 804 videos from 17, 566 minutes of web videos, which is partitioned into a training set (8, 783 videos) and a test set (2, 021 videos). It is worth noting that the dataset only consists of about 100 positive videos in training set for each event. The per-event performance for all the methods in comparison is shown in Figure 2. From the experimental results, we have the following observations: (1) Our proposed RMAE video representation produces better results than all the other baseline methods in terms of MAP, with significant performance improvements on the five events. (2) The RICA and RMAE video representation methods outperform the single visual feature (SVF), early fusion (EF) and late fusion (LF) methods by a large margin. This is due to the fact that the neural network based methods encode the useful audio and visual information while SVF only considers visual information. Both of EF and LF methods simply fuse audio and visual features/scores in a superficial way without exploring their dependence. (3) All the multi-modality methods, including RMAE, RICA and the state-of-theart methods obtain promising results better than those based on a single feature, which confirms the superiority of considering multi-modality in the task of video event detection. (4) Compared with RMAE and RICA methods, our proposed method has significant improvement as RMAE captures the relationships between audio and visual modalities from the same category of videos. For example, the performance difference between RICA and RMAE in Attempt a board trick is larger than in Wedding ceremony category. This is because that Attempting board trick event usually includes simpler hitting sounds when the visual content shows the board trick actions. In contrast, Wedding ceremony event oftentimes involves more complex sounds, such as applause, various background music, which makes the event detection more difficult. In addition, our proposed method achieves better results Figure 3: Compared performance of each single modality on two benchmarks. than BMBoW-MP method. Although bi-modal bag-ofword concurrently captures both of the audio and visual information, there still remain some isolated audio and visual words after performing the bipartite partitioning algorithm. Therefore, the bi-modal BoW representation may incur information loss. In contrast, our proposed method utilizes the learned audio and visual auto-encoders to represent each video with minimal information loss Experiment on Columbia Consumer Video (CCV) Dataset We use the Columbia Consumer Video (CCV) benchmark dataset [8] in our second experiment. This dataset includes 9, 317 YouTube videos annotated over 20 different semantic categories. There are 4, 659 videos designing for training and the remaining 4, 658 videos for testing. The per-event performance of all the methods in comparison is shown in Figure 4, where we follow the setting in [24] and the size of bi-modal codebook is set as 6, 000. From the results, we can see the following results: (1) the proposed RMAE method achieves the best performance in terms of MAP. In particular, it outperforms the SVF, LF and BMBoW-MP by 9.89%, 6.52% and 1.21% respectively, which clearly demonstrates that our method is superior to all the baseline feature representations. (2) RMAE method achieves the best performance on most of the event categories. For instance, on event bird, our method outperforms the best baseline method SVF by 7.61%. (3) Compared with the baseline LF between the object and event concept categories, our method achieves a relative high performance gain on category music performance than dog. The reason may be that the music performance category oftentimes includes certain background audio information than the dog category in which LF method can not successfully capture the relationships between multiple modalities. For example, the background music in dog category includes not only barking of dogs but also noise from the environment, which is indistinguishable for classification. (4) In the most event/ scene categories, i.e., excluding biking, bird and parade 670

6 Figure 4: Per-category performance comparison on CCV dataset. categories, the proposed method achieves better results than RICA method, which demonstrates the final term in Eq. 2 takes the advantage of the correlations between audio and visual modalities. In general, we expect high performance impact of the proposed RMAE method that automatically captures the relationships between multiple modalities. Figure 3 shows the compared performance of each modality on the two datasets. As we can see, the proposed RMAE method obtains the best performance than using each individual modality. This confirms the conclusion made in [24], that inclusion of audio modality can significantly improve the video event detection performance. Moreover, the audio modality has less information loss on CCV dataset than TRECVID MED 2011 dataset, since the audio feature can be extracted with distinguishable patterns in certain categories such as music performance event. Compared with SVF and RAME methods, we can see that the proposed method has attained better results than using SIFT BoW feature extraction. This makes our proposed method to be a more robust visual representation for video detection. 5. Conclusion In this work, we have introduced a regularized multimodality deep learning algorithm to improve conventional feature extraction for video event detection. The proposed method not only learns the audio and visual auto-encoders during unsupervised learning but also encodes the relationships between audio and visual modalities with minimal information loss in the same video. In addition, with the help of spatial maximum pooling and local contrast normalization, features learned from our neural network are robust to local variances, i.,e., adopting LCN and pooling methods improves for the video event detection. Extensive experiments on two public video event benchmarks consistently show that the proposed method significantly outperforms the manually designed features and fusion methods. In addition, compared with RICA based deep neural network, our method achieves the best performance. The promising results show the effectiveness of our proposed multi-modality deep learning for video event detection. References [1] [2] M. Beal, N. Jojic, and H. Attias. A Graphical Model for Audiovisual Object Tracking. IEEE TPAMI, [3] M. Blum, J. Springenberg, J. Wlfing, and M. Riedmiller. A Learned Feature Descriptor for Object Recognition in RGB-D Data. In ICRA 12. [4] M. Cristani, M. Bicego, and V. Murino. Audio-visual Event Recognition in Surveillance Video Sequences. IEEE TMM, [5] A. Frome, D. Huber, R. Kolluri, T. Bulow, and J. Malik. Recognizing Objects in Range Data Using Regional Point Descriptors. In ECCV 04. [6] G. E. Hinton, S. Osindero, and Y.-W. Teh. A Fast Learning Algorithm for Deep Belief Nets. In Neural Computation, 18(7): , [7] A. Hyvarinen, J. Karhunen, and E. Oja. Independent Component Analysis. In Wiley Interscience, [8] Y.-G. Jiang, G. Ye, S.-F. Chang, D. Ellis and, A. Loui. Consumer Video Understanding: A Benchmark Database and an Evaluation of Human and Machine Performance. In ICMR, [9] Y.-G. Jiang, S. Bhattacharya, S.-F. Chang and M. Shah. High-Level Event Recognition in Unconstrained Videos. In IJMIR, [10] W. Jiang and A. Loui. Audio-visual Grouplet: Temporal Audio-visual Interactions for General Video Concept Classification. In ACM MM, [11] K. Jarrett, K. Kavukcuoglu, M. A. Ranzato, and Y. LeCun. What is the Best Multi-stage Architecture for Object Recognition? In ICCV 09. [12] I. Laptev and T. Lindeberg. On Space-time Interest Points. IJCV, [13] Q. V. Le, A. Karpenko, J. Ngiam, and A. Y. Ng. ICA with Reconstruction Cost for Efficient Overcomplete Feature Learning. In NIPS 11. [14] Q. V. Le, J. Ngiam, Z. Chen, D. Chia, P. Koh, and A. Y. Ng. Tiled Convolutional Neural Networks. In NIPS 10. [15] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation Applied to Handwritten Zip Code Rsecognition. In Neural Computation, 1(4): , [16] D. Lowe. Distinctive Image Features from Scale-invariant Keypoints. IJCV, [17] S. Lyu, and E. Simoncelli. Nonlinear Image Representation Using Divisive Normalization. In ICCV 09. [18] P. Natarajan, V. Manohar, S. Wu, S. Tsakalidis, S. N. Vitaladevumi, X. Zhuang, R. Prasad, G. Ye, D. Liu, I.-H. Jhuo, S.-F. Chang, H. Izadinia I. Saleemi and M. Shah BBN VISER TRECVID 2011 Multimedia Event Detection System. In NIST TRECVID Workshop, [19] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A.Y. Ng. Multimodal Deep Learning. In ICML 11. [20] L. Pols. Spectral Analysis and Identification of Dutch Vowels in Monosyllabic Words. Doctoral dissertion, [21] S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio. Contractive Auto-encoders: Explicit Invariance during Feature Extraction. In ICML 11. [22] N. Srivastava, and R. Salakhutdinov. Multimodal Learning with Deep Boltzmann Machines. In NIPS 12, 25, pages [23] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol. Stacked Denoising Autoencoders :Learning Useful Representations in A Deep Network with A Local Denoising Criterion. In JMLR 10, 11(5): , [24] G. Ye, I.-H. Jhuo, D. Liu, Y.-G. Jiang, D.T. Lee, and S.-F. Chang. Joint Audio-Visual Bi-Modal Codewords for Video Event Detection. In ICMR 12. [25] J.-C. Wang, Y.-H. Yang, I.-H. Jhuo, Y.-Y. Lin and H.-M. Wang. The Acousticvisual Emotion Guassians Model for Automatic Generation of Music Video. In ACM MM, [26] M. Zeiler, D. Krishnan, G. Taylor, and R. Fergus. Deconvolutional Networks. In CVPR

Consumer Video Understanding

Consumer Video Understanding Consumer Video Understanding A Benchmark Database + An Evaluation of Human & Machine Performance Yu-Gang Jiang, Guangnan Ye, Shih-Fu Chang, Daniel Ellis, Alexander C. Loui Columbia University Kodak Research

More information

Stacked Denoising Autoencoders for Face Pose Normalization

Stacked Denoising Autoencoders for Face Pose Normalization Stacked Denoising Autoencoders for Face Pose Normalization Yoonseop Kang 1, Kang-Tae Lee 2,JihyunEun 2, Sung Eun Park 2 and Seungjin Choi 1 1 Department of Computer Science and Engineering Pohang University

More information

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks Si Chen The George Washington University sichen@gwmail.gwu.edu Meera Hahn Emory University mhahn7@emory.edu Mentor: Afshin

More information

DescriptorEnsemble: An Unsupervised Approach to Image Matching and Alignment with Multiple Descriptors

DescriptorEnsemble: An Unsupervised Approach to Image Matching and Alignment with Multiple Descriptors DescriptorEnsemble: An Unsupervised Approach to Image Matching and Alignment with Multiple Descriptors 林彥宇副研究員 Yen-Yu Lin, Associate Research Fellow 中央研究院資訊科技創新研究中心 Research Center for IT Innovation, Academia

More information

Convolutional-Recursive Deep Learning for 3D Object Classification

Convolutional-Recursive Deep Learning for 3D Object Classification Convolutional-Recursive Deep Learning for 3D Object Classification Richard Socher, Brody Huval, Bharath Bhat, Christopher D. Manning, Andrew Y. Ng NIPS 2012 Iro Armeni, Manik Dhar Motivation Hand-designed

More information

Video annotation based on adaptive annular spatial partition scheme

Video annotation based on adaptive annular spatial partition scheme Video annotation based on adaptive annular spatial partition scheme Guiguang Ding a), Lu Zhang, and Xiaoxu Li Key Laboratory for Information System Security, Ministry of Education, Tsinghua National Laboratory

More information

High-level Event Recognition in Internet Videos

High-level Event Recognition in Internet Videos High-level Event Recognition in Internet Videos Yu-Gang Jiang School of Computer Science Fudan University, Shanghai, China ygj@fudan.edu.cn Joint work with Guangnan Ye 1, Subh Bhattacharya 2, Dan Ellis

More information

Learning Invariant Representations with Local Transformations

Learning Invariant Representations with Local Transformations Kihyuk Sohn kihyuks@umich.edu Honglak Lee honglak@eecs.umich.edu Dept. of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI 48109, USA Abstract Learning invariant representations

More information

Autoencoders, denoising autoencoders, and learning deep networks

Autoencoders, denoising autoencoders, and learning deep networks 4 th CiFAR Summer School on Learning and Vision in Biology and Engineering Toronto, August 5-9 2008 Autoencoders, denoising autoencoders, and learning deep networks Part II joint work with Hugo Larochelle,

More information

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

COMP 551 Applied Machine Learning Lecture 16: Deep Learning COMP 551 Applied Machine Learning Lecture 16: Deep Learning Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise noted, all

More information

Aggregating Descriptors with Local Gaussian Metrics

Aggregating Descriptors with Local Gaussian Metrics Aggregating Descriptors with Local Gaussian Metrics Hideki Nakayama Grad. School of Information Science and Technology The University of Tokyo Tokyo, JAPAN nakayama@ci.i.u-tokyo.ac.jp Abstract Recently,

More information

A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images

A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images Marc Aurelio Ranzato Yann LeCun Courant Institute of Mathematical Sciences New York University - New York, NY 10003 Abstract

More information

arxiv: v1 [cs.lg] 20 Dec 2013

arxiv: v1 [cs.lg] 20 Dec 2013 Unsupervised Feature Learning by Deep Sparse Coding Yunlong He Koray Kavukcuoglu Yun Wang Arthur Szlam Yanjun Qi arxiv:1312.5783v1 [cs.lg] 20 Dec 2013 Abstract In this paper, we propose a new unsupervised

More information

Deep Similarity Learning for Multimodal Medical Images

Deep Similarity Learning for Multimodal Medical Images Deep Similarity Learning for Multimodal Medical Images Xi Cheng, Li Zhang, and Yefeng Zheng Siemens Corporation, Corporate Technology, Princeton, NJ, USA Abstract. An effective similarity measure for multi-modal

More information

Multiple Kernel Learning for Emotion Recognition in the Wild

Multiple Kernel Learning for Emotion Recognition in the Wild Multiple Kernel Learning for Emotion Recognition in the Wild Karan Sikka, Karmen Dykstra, Suchitra Sathyanarayana, Gwen Littlewort and Marian S. Bartlett Machine Perception Laboratory UCSD EmotiW Challenge,

More information

Sparse coding for image classification

Sparse coding for image classification Sparse coding for image classification Columbia University Electrical Engineering: Kun Rong(kr2496@columbia.edu) Yongzhou Xiang(yx2211@columbia.edu) Yin Cui(yc2776@columbia.edu) Outline Background Introduction

More information

A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images

A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images Marc Aurelio Ranzato Yann LeCun Courant Institute of Mathematical Sciences New York University - New York, NY 10003 Abstract

More information

Neural Networks: promises of current research

Neural Networks: promises of current research April 2008 www.apstat.com Current research on deep architectures A few labs are currently researching deep neural network training: Geoffrey Hinton s lab at U.Toronto Yann LeCun s lab at NYU Our LISA lab

More information

Learning Two-Layer Contractive Encodings

Learning Two-Layer Contractive Encodings In Proceedings of International Conference on Artificial Neural Networks (ICANN), pp. 620-628, September 202. Learning Two-Layer Contractive Encodings Hannes Schulz and Sven Behnke Rheinische Friedrich-Wilhelms-Universität

More information

arxiv: v1 [cs.mm] 12 Jan 2016

arxiv: v1 [cs.mm] 12 Jan 2016 Learning Subclass Representations for Visually-varied Image Classification Xinchao Li, Peng Xu, Yue Shi, Martha Larson, Alan Hanjalic Multimedia Information Retrieval Lab, Delft University of Technology

More information

Learning Visual Semantics: Models, Massive Computation, and Innovative Applications

Learning Visual Semantics: Models, Massive Computation, and Innovative Applications Learning Visual Semantics: Models, Massive Computation, and Innovative Applications Part II: Visual Features and Representations Liangliang Cao, IBM Watson Research Center Evolvement of Visual Features

More information

A Deep Learning Framework for Authorship Classification of Paintings

A Deep Learning Framework for Authorship Classification of Paintings A Deep Learning Framework for Authorship Classification of Paintings Kai-Lung Hua ( 花凱龍 ) Dept. of Computer Science and Information Engineering National Taiwan University of Science and Technology Taipei,

More information

Deep Learning for Computer Vision II

Deep Learning for Computer Vision II IIIT Hyderabad Deep Learning for Computer Vision II C. V. Jawahar Paradigm Shift Feature Extraction (SIFT, HoG, ) Part Models / Encoding Classifier Sparrow Feature Learning Classifier Sparrow L 1 L 2 L

More information

LEARNING A SPARSE DICTIONARY OF VIDEO STRUCTURE FOR ACTIVITY MODELING. Nandita M. Nayak, Amit K. Roy-Chowdhury. University of California, Riverside

LEARNING A SPARSE DICTIONARY OF VIDEO STRUCTURE FOR ACTIVITY MODELING. Nandita M. Nayak, Amit K. Roy-Chowdhury. University of California, Riverside LEARNING A SPARSE DICTIONARY OF VIDEO STRUCTURE FOR ACTIVITY MODELING Nandita M. Nayak, Amit K. Roy-Chowdhury University of California, Riverside ABSTRACT We present an approach which incorporates spatiotemporal

More information

Multipath Sparse Coding Using Hierarchical Matching Pursuit

Multipath Sparse Coding Using Hierarchical Matching Pursuit Multipath Sparse Coding Using Hierarchical Matching Pursuit Liefeng Bo, Xiaofeng Ren ISTC Pervasive Computing, Intel Labs Seattle WA 98195, USA {liefeng.bo,xiaofeng.ren}@intel.com Dieter Fox University

More information

Tiled convolutional neural networks

Tiled convolutional neural networks Tiled convolutional neural networks Quoc V. Le, Jiquan Ngiam, Zhenghao Chen, Daniel Chia, Pang Wei Koh, Andrew Y. Ng Computer Science Department, Stanford University {quocle,jngiam,zhenghao,danchia,pangwei,ang}@cs.stanford.edu

More information

Introduction to Deep Learning

Introduction to Deep Learning ENEE698A : Machine Learning Seminar Introduction to Deep Learning Raviteja Vemulapalli Image credit: [LeCun 1998] Resources Unsupervised feature learning and deep learning (UFLDL) tutorial (http://ufldl.stanford.edu/wiki/index.php/ufldl_tutorial)

More information

Computer Vision Lecture 16

Computer Vision Lecture 16 Computer Vision Lecture 16 Deep Learning for Object Categorization 14.01.2016 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Announcements Seminar registration period

More information

Facial Expression Classification with Random Filters Feature Extraction

Facial Expression Classification with Random Filters Feature Extraction Facial Expression Classification with Random Filters Feature Extraction Mengye Ren Facial Monkey mren@cs.toronto.edu Zhi Hao Luo It s Me lzh@cs.toronto.edu I. ABSTRACT In our work, we attempted to tackle

More information

Deep Neural Networks:

Deep Neural Networks: Deep Neural Networks: Part II Convolutional Neural Network (CNN) Yuan-Kai Wang, 2016 Web site of this course: http://pattern-recognition.weebly.com source: CNN for ImageClassification, by S. Lazebnik,

More information

String distance for automatic image classification

String distance for automatic image classification String distance for automatic image classification Nguyen Hong Thinh*, Le Vu Ha*, Barat Cecile** and Ducottet Christophe** *University of Engineering and Technology, Vietnam National University of HaNoi,

More information

Novel Lossy Compression Algorithms with Stacked Autoencoders

Novel Lossy Compression Algorithms with Stacked Autoencoders Novel Lossy Compression Algorithms with Stacked Autoencoders Anand Atreya and Daniel O Shea {aatreya, djoshea}@stanford.edu 11 December 2009 1. Introduction 1.1. Lossy compression Lossy compression is

More information

Hierarchical Matching Pursuit for Image Classification: Architecture and Fast Algorithms

Hierarchical Matching Pursuit for Image Classification: Architecture and Fast Algorithms Hierarchical Matching Pursuit for Image Classification: Architecture and Fast Algorithms Liefeng Bo University of Washington Seattle WA 98195, USA Xiaofeng Ren ISTC-Pervasive Computing Intel Labs Seattle

More information

Grounded Compositional Semantics for Finding and Describing Images with Sentences

Grounded Compositional Semantics for Finding and Describing Images with Sentences Grounded Compositional Semantics for Finding and Describing Images with Sentences R. Socher, A. Karpathy, V. Le,D. Manning, A Y. Ng - 2013 Ali Gharaee 1 Alireza Keshavarzi 2 1 Department of Computational

More information

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li Learning to Match Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li 1. Introduction The main tasks in many applications can be formalized as matching between heterogeneous objects, including search, recommendation,

More information

Multimodal Sparse Coding for Event Detection

Multimodal Sparse Coding for Event Detection Multimodal Sparse Coding for Event Detection Youngjune Gwon William M. Campbell Kevin Brady Douglas Sturim MIT Lincoln Laboratory, Lexington, M 02420, US Miriam Cha H. T. Kung Harvard University, Cambridge,

More information

Bilevel Sparse Coding

Bilevel Sparse Coding Adobe Research 345 Park Ave, San Jose, CA Mar 15, 2013 Outline 1 2 The learning model The learning algorithm 3 4 Sparse Modeling Many types of sensory data, e.g., images and audio, are in high-dimensional

More information

Classification of objects from Video Data (Group 30)

Classification of objects from Video Data (Group 30) Classification of objects from Video Data (Group 30) Sheallika Singh 12665 Vibhuti Mahajan 12792 Aahitagni Mukherjee 12001 M Arvind 12385 1 Motivation Video surveillance has been employed for a long time

More information

Supplementary material for the paper Are Sparse Representations Really Relevant for Image Classification?

Supplementary material for the paper Are Sparse Representations Really Relevant for Image Classification? Supplementary material for the paper Are Sparse Representations Really Relevant for Image Classification? Roberto Rigamonti, Matthew A. Brown, Vincent Lepetit CVLab, EPFL Lausanne, Switzerland firstname.lastname@epfl.ch

More information

Deep Learning. Volker Tresp Summer 2014

Deep Learning. Volker Tresp Summer 2014 Deep Learning Volker Tresp Summer 2014 1 Neural Network Winter and Revival While Machine Learning was flourishing, there was a Neural Network winter (late 1990 s until late 2000 s) Around 2010 there

More information

Text Detection and Character Recognition in Scene Images with Unsupervised Feature Learning

Text Detection and Character Recognition in Scene Images with Unsupervised Feature Learning Text Detection and Character Recognition in Scene Images with Unsupervised Feature Learning Adam Coates, Blake Carpenter, Carl Case, Sanjeev Satheesh, Bipin Suresh, Tao Wang, Andrew Y. Ng Computer Science

More information

Class 9 Action Recognition

Class 9 Action Recognition Class 9 Action Recognition Liangliang Cao, April 4, 2013 EECS 6890 Topics in Information Processing Spring 2013, Columbia University http://rogerioferis.com/visualrecognitionandsearch Visual Recognition

More information

CS229: Action Recognition in Tennis

CS229: Action Recognition in Tennis CS229: Action Recognition in Tennis Aman Sikka Stanford University Stanford, CA 94305 Rajbir Kataria Stanford University Stanford, CA 94305 asikka@stanford.edu rkataria@stanford.edu 1. Motivation As active

More information

An Analysis of Single-Layer Networks in Unsupervised Feature Learning

An Analysis of Single-Layer Networks in Unsupervised Feature Learning An Analysis of Single-Layer Networks in Unsupervised Feature Learning Adam Coates Honglak Lee Andrew Y. Ng Stanford University Computer Science Dept. 353 Serra Mall Stanford, CA 94305 University of Michigan

More information

C. Poultney S. Cho pra (NYU Courant Institute) Y. LeCun

C. Poultney S. Cho pra (NYU Courant Institute) Y. LeCun Efficient Learning of Sparse Overcomplete Representations with an Energy-Based Model Marc'Aurelio Ranzato C. Poultney S. Cho pra (NYU Courant Institute) Y. LeCun CIAR Summer School Toronto 2006 Why Extracting

More information

Bidirectional Recurrent Convolutional Networks for Video Super-Resolution

Bidirectional Recurrent Convolutional Networks for Video Super-Resolution Bidirectional Recurrent Convolutional Networks for Video Super-Resolution Qi Zhang & Yan Huang Center for Research on Intelligent Perception and Computing (CRIPAC) National Laboratory of Pattern Recognition

More information

Extracting Spatio-temporal Local Features Considering Consecutiveness of Motions

Extracting Spatio-temporal Local Features Considering Consecutiveness of Motions Extracting Spatio-temporal Local Features Considering Consecutiveness of Motions Akitsugu Noguchi and Keiji Yanai Department of Computer Science, The University of Electro-Communications, 1-5-1 Chofugaoka,

More information

Learning-based Methods in Vision

Learning-based Methods in Vision Learning-based Methods in Vision 16-824 Sparsity and Deep Learning Motivation Multitude of hand-designed features currently in use in vision - SIFT, HoG, LBP, MSER, etc. Even the best approaches, just

More information

CS231A Course Project Final Report Sign Language Recognition with Unsupervised Feature Learning

CS231A Course Project Final Report Sign Language Recognition with Unsupervised Feature Learning CS231A Course Project Final Report Sign Language Recognition with Unsupervised Feature Learning Justin Chen Stanford University justinkchen@stanford.edu Abstract This paper focuses on experimenting with

More information

Learning Multiple Non-Linear Sub-Spaces using K-RBMs

Learning Multiple Non-Linear Sub-Spaces using K-RBMs 2013 IEEE Conference on Computer Vision and Pattern Recognition Learning Multiple Non-Linear Sub-Spaces using K-RBMs Siddhartha Chandra 1 Shailesh Kumar 2 C. V. Jawahar 1 1 CVIT, IIIT Hyderabad, 2 Google,

More information

Restricted Boltzmann Machines. Shallow vs. deep networks. Stacked RBMs. Boltzmann Machine learning: Unsupervised version

Restricted Boltzmann Machines. Shallow vs. deep networks. Stacked RBMs. Boltzmann Machine learning: Unsupervised version Shallow vs. deep networks Restricted Boltzmann Machines Shallow: one hidden layer Features can be learned more-or-less independently Arbitrary function approximator (with enough hidden units) Deep: two

More information

Preliminary Local Feature Selection by Support Vector Machine for Bag of Features

Preliminary Local Feature Selection by Support Vector Machine for Bag of Features Preliminary Local Feature Selection by Support Vector Machine for Bag of Features Tetsu Matsukawa Koji Suzuki Takio Kurita :University of Tsukuba :National Institute of Advanced Industrial Science and

More information

Action recognition in videos

Action recognition in videos Action recognition in videos Cordelia Schmid INRIA Grenoble Joint work with V. Ferrari, A. Gaidon, Z. Harchaoui, A. Klaeser, A. Prest, H. Wang Action recognition - goal Short actions, i.e. drinking, sit

More information

Cost-alleviative Learning for Deep Convolutional Neural Network-based Facial Part Labeling

Cost-alleviative Learning for Deep Convolutional Neural Network-based Facial Part Labeling [DOI: 10.2197/ipsjtcva.7.99] Express Paper Cost-alleviative Learning for Deep Convolutional Neural Network-based Facial Part Labeling Takayoshi Yamashita 1,a) Takaya Nakamura 1 Hiroshi Fukui 1,b) Yuji

More information

Part-based and local feature models for generic object recognition

Part-based and local feature models for generic object recognition Part-based and local feature models for generic object recognition May 28 th, 2015 Yong Jae Lee UC Davis Announcements PS2 grades up on SmartSite PS2 stats: Mean: 80.15 Standard Dev: 22.77 Vote on piazza

More information

Motion Sequence Recognition with Multi-sensors Using Deep Convolutional Neural Network

Motion Sequence Recognition with Multi-sensors Using Deep Convolutional Neural Network Motion Sequence Recognition with Multi-sensors Using Deep Convolutional Neural Network Runfeng Zhang and Chunping Li Abstract With the rapid development of intelligent devices, motion recognition methods

More information

Previously. Part-based and local feature models for generic object recognition. Bag-of-words model 4/20/2011

Previously. Part-based and local feature models for generic object recognition. Bag-of-words model 4/20/2011 Previously Part-based and local feature models for generic object recognition Wed, April 20 UT-Austin Discriminative classifiers Boosting Nearest neighbors Support vector machines Useful for object recognition

More information

Clustering and Unsupervised Anomaly Detection with l 2 Normalized Deep Auto-Encoder Representations

Clustering and Unsupervised Anomaly Detection with l 2 Normalized Deep Auto-Encoder Representations Clustering and Unsupervised Anomaly Detection with l 2 Normalized Deep Auto-Encoder Representations Caglar Aytekin, Xingyang Ni, Francesco Cricri and Emre Aksu Nokia Technologies, Tampere, Finland Corresponding

More information

Advanced Introduction to Machine Learning, CMU-10715

Advanced Introduction to Machine Learning, CMU-10715 Advanced Introduction to Machine Learning, CMU-10715 Deep Learning Barnabás Póczos, Sept 17 Credits Many of the pictures, results, and other materials are taken from: Ruslan Salakhutdinov Joshua Bengio

More information

arxiv: v3 [cs.cv] 3 Oct 2012

arxiv: v3 [cs.cv] 3 Oct 2012 Combined Descriptors in Spatial Pyramid Domain for Image Classification Junlin Hu and Ping Guo arxiv:1210.0386v3 [cs.cv] 3 Oct 2012 Image Processing and Pattern Recognition Laboratory Beijing Normal University,

More information

Object Recognition with Hierarchical Kernel Descriptors

Object Recognition with Hierarchical Kernel Descriptors Object Recognition with Hierarchical Kernel Descriptors Liefeng Bo 1 Kevin Lai 1 Xiaofeng Ren 2 Dieter Fox 1,2 University of Washington 1 Intel Labs Seattle 2 {lfb,kevinlai,fox}@cs.washington.edu xiaofeng.ren@intel.com

More information

Learning Feature Hierarchies for Object Recognition

Learning Feature Hierarchies for Object Recognition Learning Feature Hierarchies for Object Recognition Koray Kavukcuoglu Computer Science Department Courant Institute of Mathematical Sciences New York University Marc Aurelio Ranzato, Kevin Jarrett, Pierre

More information

Tri-modal Human Body Segmentation

Tri-modal Human Body Segmentation Tri-modal Human Body Segmentation Master of Science Thesis Cristina Palmero Cantariño Advisor: Sergio Escalera Guerrero February 6, 2014 Outline 1 Introduction 2 Tri-modal dataset 3 Proposed baseline 4

More information

ECCV Presented by: Boris Ivanovic and Yolanda Wang CS 331B - November 16, 2016

ECCV Presented by: Boris Ivanovic and Yolanda Wang CS 331B - November 16, 2016 ECCV 2016 Presented by: Boris Ivanovic and Yolanda Wang CS 331B - November 16, 2016 Fundamental Question What is a good vector representation of an object? Something that can be easily predicted from 2D

More information

Using Machine Learning for Classification of Cancer Cells

Using Machine Learning for Classification of Cancer Cells Using Machine Learning for Classification of Cancer Cells Camille Biscarrat University of California, Berkeley I Introduction Cell screening is a commonly used technique in the development of new drugs.

More information

Semantic Segmentation

Semantic Segmentation Semantic Segmentation UCLA:https://goo.gl/images/I0VTi2 OUTLINE Semantic Segmentation Why? Paper to talk about: Fully Convolutional Networks for Semantic Segmentation. J. Long, E. Shelhamer, and T. Darrell,

More information

Supervised Hashing for Image Retrieval via Image Representation Learning

Supervised Hashing for Image Retrieval via Image Representation Learning Supervised Hashing for Image Retrieval via Image Representation Learning Rongkai Xia, Yan Pan, Cong Liu (Sun Yat-Sen University) Hanjiang Lai, Shuicheng Yan (National University of Singapore) Finding Similar

More information

An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation

An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation Hugo Larochelle, Dumitru Erhan, Aaron Courville, James Bergstra, and Yoshua Bengio Université de Montréal 13/06/2007

More information

An Exploration of Computer Vision Techniques for Bird Species Classification

An Exploration of Computer Vision Techniques for Bird Species Classification An Exploration of Computer Vision Techniques for Bird Species Classification Anne L. Alter, Karen M. Wang December 15, 2017 Abstract Bird classification, a fine-grained categorization task, is a complex

More information

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS Puyang Xu, Ruhi Sarikaya Microsoft Corporation ABSTRACT We describe a joint model for intent detection and slot filling based

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Example Learning Problem Example Learning Problem Celebrity Faces in the Wild Machine Learning Pipeline Raw data Feature extract. Feature computation Inference: prediction,

More information

3D Object Recognition with Deep Belief Nets

3D Object Recognition with Deep Belief Nets 3D Object Recognition with Deep Belief Nets Vinod Nair and Geoffrey E. Hinton Department of Computer Science, University of Toronto 10 King s College Road, Toronto, M5S 3G5 Canada {vnair,hinton}@cs.toronto.edu

More information

Neural Networks for Machine Learning. Lecture 15a From Principal Components Analysis to Autoencoders

Neural Networks for Machine Learning. Lecture 15a From Principal Components Analysis to Autoencoders Neural Networks for Machine Learning Lecture 15a From Principal Components Analysis to Autoencoders Geoffrey Hinton Nitish Srivastava, Kevin Swersky Tijmen Tieleman Abdel-rahman Mohamed Principal Components

More information

A FRAMEWORK OF EXTRACTING MULTI-SCALE FEATURES USING MULTIPLE CONVOLUTIONAL NEURAL NETWORKS. Kuan-Chuan Peng and Tsuhan Chen

A FRAMEWORK OF EXTRACTING MULTI-SCALE FEATURES USING MULTIPLE CONVOLUTIONAL NEURAL NETWORKS. Kuan-Chuan Peng and Tsuhan Chen A FRAMEWORK OF EXTRACTING MULTI-SCALE FEATURES USING MULTIPLE CONVOLUTIONAL NEURAL NETWORKS Kuan-Chuan Peng and Tsuhan Chen School of Electrical and Computer Engineering, Cornell University, Ithaca, NY

More information

Multimodal Medical Image Retrieval based on Latent Topic Modeling

Multimodal Medical Image Retrieval based on Latent Topic Modeling Multimodal Medical Image Retrieval based on Latent Topic Modeling Mandikal Vikram 15it217.vikram@nitk.edu.in Suhas BS 15it110.suhas@nitk.edu.in Aditya Anantharaman 15it201.aditya.a@nitk.edu.in Sowmya Kamath

More information

BUAA AUDR at ImageCLEF 2012 Photo Annotation Task

BUAA AUDR at ImageCLEF 2012 Photo Annotation Task BUAA AUDR at ImageCLEF 2012 Photo Annotation Task Lei Huang, Yang Liu State Key Laboratory of Software Development Enviroment, Beihang University, 100191 Beijing, China huanglei@nlsde.buaa.edu.cn liuyang@nlsde.buaa.edu.cn

More information

Energy Based Models, Restricted Boltzmann Machines and Deep Networks. Jesse Eickholt

Energy Based Models, Restricted Boltzmann Machines and Deep Networks. Jesse Eickholt Energy Based Models, Restricted Boltzmann Machines and Deep Networks Jesse Eickholt ???? Who s heard of Energy Based Models (EBMs) Restricted Boltzmann Machines (RBMs) Deep Belief Networks Auto-encoders

More information

EVENT DETECTION AND HUMAN BEHAVIOR RECOGNITION. Ing. Lorenzo Seidenari

EVENT DETECTION AND HUMAN BEHAVIOR RECOGNITION. Ing. Lorenzo Seidenari EVENT DETECTION AND HUMAN BEHAVIOR RECOGNITION Ing. Lorenzo Seidenari e-mail: seidenari@dsi.unifi.it What is an Event? Dictionary.com definition: something that occurs in a certain place during a particular

More information

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU, Machine Learning 10-701, Fall 2015 Deep Learning Eric Xing (and Pengtao Xie) Lecture 8, October 6, 2015 Eric Xing @ CMU, 2015 1 A perennial challenge in computer vision: feature engineering SIFT Spin image

More information

Deep Convolutional Neural Networks. Nov. 20th, 2015 Bruce Draper

Deep Convolutional Neural Networks. Nov. 20th, 2015 Bruce Draper Deep Convolutional Neural Networks Nov. 20th, 2015 Bruce Draper Background: Fully-connected single layer neural networks Feed-forward classification Trained through back-propagation Example Computer Vision

More information

Improving Recognition through Object Sub-categorization

Improving Recognition through Object Sub-categorization Improving Recognition through Object Sub-categorization Al Mansur and Yoshinori Kuno Graduate School of Science and Engineering, Saitama University, 255 Shimo-Okubo, Sakura-ku, Saitama-shi, Saitama 338-8570,

More information

Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations

Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations Honglak Lee Roger Grosse Rajesh Ranganath Andrew Y. Ng Computer Science Department, Stanford University,

More information

Visual object classification by sparse convolutional neural networks

Visual object classification by sparse convolutional neural networks Visual object classification by sparse convolutional neural networks Alexander Gepperth 1 1- Ruhr-Universität Bochum - Institute for Neural Dynamics Universitätsstraße 150, 44801 Bochum - Germany Abstract.

More information

IMPROVING SPATIO-TEMPORAL FEATURE EXTRACTION TECHNIQUES AND THEIR APPLICATIONS IN ACTION CLASSIFICATION. Maral Mesmakhosroshahi, Joohee Kim

IMPROVING SPATIO-TEMPORAL FEATURE EXTRACTION TECHNIQUES AND THEIR APPLICATIONS IN ACTION CLASSIFICATION. Maral Mesmakhosroshahi, Joohee Kim IMPROVING SPATIO-TEMPORAL FEATURE EXTRACTION TECHNIQUES AND THEIR APPLICATIONS IN ACTION CLASSIFICATION Maral Mesmakhosroshahi, Joohee Kim Department of Electrical and Computer Engineering Illinois Institute

More information

Transfer Learning Using Rotated Image Data to Improve Deep Neural Network Performance

Transfer Learning Using Rotated Image Data to Improve Deep Neural Network Performance Transfer Learning Using Rotated Image Data to Improve Deep Neural Network Performance Telmo Amaral¹, Luís M. Silva¹², Luís A. Alexandre³, Chetak Kandaswamy¹, Joaquim Marques de Sá¹ 4, and Jorge M. Santos¹

More information

深度学习 : 机器学习的新浪潮. Kai Yu Deputy Managing Director, Baidu IDL

深度学习 : 机器学习的新浪潮. Kai Yu Deputy Managing Director, Baidu IDL 深度学习 : 机器学习的新浪潮 Kai Yu Deputy Managing Director, Baidu IDL Machine Learning the design and development of algorithms that allow computers to evolve behaviors based on empirical data, 7/23/2013 2 All Machine

More information

Day 3 Lecture 1. Unsupervised Learning

Day 3 Lecture 1. Unsupervised Learning Day 3 Lecture 1 Unsupervised Learning Semi-supervised and transfer learning Myth: you can t do deep learning unless you have a million labelled examples for your problem. Reality You can learn useful representations

More information

Multi-Glance Attention Models For Image Classification

Multi-Glance Attention Models For Image Classification Multi-Glance Attention Models For Image Classification Chinmay Duvedi Stanford University Stanford, CA cduvedi@stanford.edu Pararth Shah Stanford University Stanford, CA pararth@stanford.edu Abstract We

More information

ImageCLEF 2011

ImageCLEF 2011 SZTAKI @ ImageCLEF 2011 Bálint Daróczy joint work with András Benczúr, Róbert Pethes Data Mining and Web Search Group Computer and Automation Research Institute Hungarian Academy of Sciences Training/test

More information

Depth Image Dimension Reduction Using Deep Belief Networks

Depth Image Dimension Reduction Using Deep Belief Networks Depth Image Dimension Reduction Using Deep Belief Networks Isma Hadji* and Akshay Jain** Department of Electrical and Computer Engineering University of Missouri 19 Eng. Building West, Columbia, MO, 65211

More information

Content-Based Image Retrieval Using Deep Belief Networks

Content-Based Image Retrieval Using Deep Belief Networks Content-Based Image Retrieval Using Deep Belief Networks By Jason Kroge Submitted to the graduate degree program in the Department of Electrical Engineering and Computer Science of the University of Kansas

More information

Supplementary Material for Ensemble Diffusion for Retrieval

Supplementary Material for Ensemble Diffusion for Retrieval Supplementary Material for Ensemble Diffusion for Retrieval Song Bai 1, Zhichao Zhou 1, Jingdong Wang, Xiang Bai 1, Longin Jan Latecki 3, Qi Tian 4 1 Huazhong University of Science and Technology, Microsoft

More information

Latest development in image feature representation and extraction

Latest development in image feature representation and extraction International Journal of Advanced Research and Development ISSN: 2455-4030, Impact Factor: RJIF 5.24 www.advancedjournal.com Volume 2; Issue 1; January 2017; Page No. 05-09 Latest development in image

More information

Easy Samples First: Self-paced Reranking for Zero-Example Multimedia Search

Easy Samples First: Self-paced Reranking for Zero-Example Multimedia Search Easy Samples First: Self-paced Reranking for Zero-Example Multimedia Search Lu Jiang 1, Deyu Meng 2, Teruko Mitamura 1, Alexander G. Hauptmann 1 1 School of Computer Science, Carnegie Mellon University

More information

KBSVM: KMeans-based SVM for Business Intelligence

KBSVM: KMeans-based SVM for Business Intelligence Association for Information Systems AIS Electronic Library (AISeL) AMCIS 2004 Proceedings Americas Conference on Information Systems (AMCIS) December 2004 KBSVM: KMeans-based SVM for Business Intelligence

More information

Machine Learning 13. week

Machine Learning 13. week Machine Learning 13. week Deep Learning Convolutional Neural Network Recurrent Neural Network 1 Why Deep Learning is so Popular? 1. Increase in the amount of data Thanks to the Internet, huge amount of

More information

arxiv: v1 [cs.cv] 4 Feb 2018

arxiv: v1 [cs.cv] 4 Feb 2018 End2You The Imperial Toolkit for Multimodal Profiling by End-to-End Learning arxiv:1802.01115v1 [cs.cv] 4 Feb 2018 Panagiotis Tzirakis Stefanos Zafeiriou Björn W. Schuller Department of Computing Imperial

More information

The SIFT (Scale Invariant Feature

The SIFT (Scale Invariant Feature The SIFT (Scale Invariant Feature Transform) Detector and Descriptor developed by David Lowe University of British Columbia Initial paper ICCV 1999 Newer journal paper IJCV 2004 Review: Matt Brown s Canonical

More information

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech Convolutional Neural Networks Computer Vision Jia-Bin Huang, Virginia Tech Today s class Overview Convolutional Neural Network (CNN) Training CNN Understanding and Visualizing CNN Image Categorization:

More information

III. VERVIEW OF THE METHODS

III. VERVIEW OF THE METHODS An Analytical Study of SIFT and SURF in Image Registration Vivek Kumar Gupta, Kanchan Cecil Department of Electronics & Telecommunication, Jabalpur engineering college, Jabalpur, India comparing the distance

More information