Evaluation of Triple-Stream Convolutional Networks for Action Recognition

Size: px
Start display at page:

Download "Evaluation of Triple-Stream Convolutional Networks for Action Recognition"

Transcription

1 Evaluation of Triple-Stream Convolutional Networks for Action Recognition Dichao Liu, Yu Wang and Jien Kato Graduate School of Informatics Nagoya University Nagoya, Japan {liu, ywang, jien} (at) mv.ss.is.nagoya-u.ac.jp Abstract Recently, Two-Stream Convolutional Network has achieved remarkable performance. Especially, by capturing appearance and motion information, spatial-temporal two-stream networks bring noticeable improvement. On the other hand, dynamic image, which is a powerful representation for videos, has also been confirmed to provide complimentary information to spatial appearance. Inspired by these works, we proposed Triple- Stream Convolutional Networks by fusing a third network stream whose input is dynamic image. In this paper, we implement the proposed Triple-Stream Convolutional Networks and evaluated them in two aspects: (a) how the overall end-to-end classification performance can be benefited by adding the dynamic stream; (b) which way is efficient to use the trained Triple-Stream Convolutional Networks in classification. Our evaluation shows improvements over both single networks (spatial and temporal) and Fused Spatial-temporal Two-Stream Network. I. INTRODUCTION Action recognition in videos is the basic step of many promising applications. It is usually approached by two kinds of methods. One is shallow method, which uses the statistics of hand-craft local features to capture the spatial-temporal characteristics of actions [1], [2], [3]. The other is deep method, which uses deep models to learn the representation of actions in an end-to-end style [4], [5], [6], [3], [7], [8], [9]. While shallow method is good, more and more recent works show the power of deep methods. Two-stream convolutional network [10] is a very famous one of them. It trains two networks, namely spatial network and temporal network. Spatial network is fed with static frames to get the spatial clues on object appearance. Temporal network is fed with optical flows to get the temporal clues on object motion. These two kinds of clues are combined by fusing the two networks [10]. This work is very successful because this model is able to recognize both the objects and the motion. Many later researches are developed based on two-stream network. Some of the researches develop video representations based on it [11], [12]. Some other researches focus on the end-toend improvement over it [5], [9]. Especially in [5], spatial and temporal networks are fused and trained together, achieving state-of-art performance. Inspired by above works on two-stream networks, we propose the triple-stream network architecture. Triple-stream network is an extension of two-stream network, in which a third network is used to process the dynamic image inputs. Dynamic image [13] is a compact yet powerful representation for videos based on Rank Pooling [14], [15]. In an earlier work [16], it has been confirmed that dynamic image is efficient in end-to-end learning on action categories. As shown in Figure 1, dynamic images intuitively look like panning or motion blur, containing temporal evolution information in a static image. Traditional two-stream convolutional networks use static frames and optical flows (as shown in Figure 1) as inputs. We hypothesize that, as an intermediary feature containing both spatial and temporal information, dynamic image would help the model better associate spatial and temporal information. In this paper, our main purpose is to evaluate how classification performance can be improved by adding dynamic image network and to find out how to efficiently make use of the triple-stream architecture. Fig. 1: Example of a static frame (a), a dynamic image (b) and a stack of optical flows (c). We regard dynamic image as an intermediary feature between appearance and motion and thus hypothesize it can help improve the performance. II. EVALUATION METHODS We use three fusion methods to construct, train and utilize fused networks, namely Convolutional Fusion, Score Fusion and Feature Fusion, as shown in Figure 2. We first evaluate the performance improvement brought by fusing dynamic image stream into single spatial and temporal networks. We separately train single spatial, temporal and dynamic image networks. Then we fuse the dynamic image network respectively into the single spatial and temporal networks by the three methods. After that we implement triple-stream networks respectively by the three methods to evaluate the triple-stream networks performance /17/$ IEEE 513

2 Fig. 2: Illustration of process of convolutional fusion (a), score fusion(b) and feature fusion(c). After the fusion, we keep all the network towers and use all of them for training and prediction. It is because: (a) When training networks, reserving all towers generally can help the whole model optimize towards the target that all streams performance will be benefited; (b) When using networks for classification, the deep features or prediction scores from different towers always complement each other. A. Fusion methods Convolutional fusion (CF) [5], [17] fuses different networks together and and then trains and utilizes fused networks together. CF has been proved to be effective for fusing networks of RGB frame inputs and optical flow inputs [5]. The advantage of this method is that it can exploit activations correspondence of the same pixel position. By using this method, [5] is able to fuse appearance and motion feature maps correspondingly and thus achieved remarkable improvement. However, as we observe, when more networks are fused together, shortcomings may also emerge, such as it tends to be more difficult to optimize the fused networks. We use CF to fuse the feature maps of ReLU5 layer from different networks. As shown in Figure 2(a), CF concatenates the feature maps from different networks together and then convolves the activations of concatenated feature maps. Similar to the convolutional fusion layer for two networks constructed in [5], we constructed convolutional fusion layer for more networks. As illustrated in equation (1), x 1, x 2,...x n are the D-channel feature maps. The operator cat means to concatenate. Concatenated feature maps are then fused by filter f R 1 1 nd D and the bias b R D. During training, f can fuse the concatenated feature maps correspondingly with dimensionality reduction. The fusion is weighted and the weights can be adjusted by minimizing joint loss when the whole networks are being trained. y = cat(x 1,x 2,...x n ) f + b (1) Score fusion (SF), as shown in Figure 2(b), firstly trains the networks separately. Then all the parameters are fixed and the networks are fused at prediction layer by average fusion. In our case, we fuse the activations of fc8 layer to be the final prediction scores. The advantage of this method is that separately trained networks are less complicated thus easier to optimize. However the disadvantage is a lack of correspondence of feature maps from different network streams. Feature fusion (FF) extracts CNN activations and then fuses the activations from different streams to be video representations. As shown in Figure 2(c), we extract the ReLU7 activations of each stream as respective video representations of each stream. Then the respective representations are aggregated via average pooling and then implemented with l2 normalization. After that, data whitening is implemented on the averaged scores to reduce dimensional dependency. Finally, we implement l2 normalization again and got our deep-learned representation for each video. B. Network fusion To evaluate how the recognition performance can be benefited from fusing dynamic image stream, we use above fusion methods to fuse dynamic image into two types of networks: one is single spatial or temporal networks and the other is spatial-temporal two-stream network [5]. Firstly, we fuse dynamic image stream with single networks and compare the performance. As shown in in Figure 3, we separately train single spatial, temporal and dynamic image networks. Then we respectively fuse dynamic image network into single spatial and temporal networks using CF, SF and FF. We evaluate whether and how the performance will be benefited by adding dynamic image stream to single networks respectively by the three fusion methods. Then, as shown in in Figure 4, we separately train a single dynamic image network and a fused spatial-temporal twostream network rather than training three single networks separately. We construct the triple-stream networks by fusing dynamic image stream into the spatial-temporal two-stream network respectively by the three fusion methods. For CF, we fuse the dynamic image stream into spatial-temporal twostream network at the same fusion layer where spatial and temporal streams are fused. For FF, we fuse the ReLu 7 activations 514

3 from all three streams together to be the video representation by the process mentioned in section II-A. However, for SF, since [13] shows dynamic images complementarity to spatial appearance to some extend, when averaging prediction scores, in order to reduce performance uncertainty of our evaluation, we average as equation (2) instead of just directly averaging the scores together. S = avg(avg(s sp,s dy ),S te ) (2) In equation (2), S is our final prediction score. S sp is the prediction score of spatial stream. S dy is the prediction score of dynamic image stream. S te is the prediction score of temporal stream. Operator avg means to average. Fig. 4: Triple-stream networks constructed by fusing dynamic image stream using CF (a), SF (b) or FF (c). Same as Fig. 3, conv here also represents layer groups. For triple-stream CF and FF, we implement by similarly way of two-stream CF and FF. For triple-stream SF, we first average the prediction scores from spatial stream and dynamic image stream. Then we average the averaged scores and the prediction scores from temporal stream. Fig. 3: Two-stream networks constructed by fusing dynamic image stream using CF (a), SF (b) or FF (c). In this figure, conv actually represents layer groups. For example, conv5 represents the layers conv5 1...conv5 3 in VGG16. III. EXPERIMENT AND EVALUATION We evaluate on the UCF101 dataset [18], which is a popular dataset consisting video clips in 101 categories. We extract RGB frames and compute optical flows from all the frames for each video. Then multiple dynamic images are generated from each video for every L dy = 10 frames. When training, dynamic image stream is fed with dynamic images while spatial and temporal stream are respectively fed with RGB frames and optical flow stacks of L te = 10 frames. In our experiment, we first evaluate in the 1st split of UCF101 to see the performance, and then the methods of higher performance will be evaluated in all three splits for testing performance consistency. A. Implementation details Details of network training: All the networks in this paper are based on the very deep VGG-16 model [19] which has totally 16 layers (13 convolutional and 3 fully-connected layers). When training or fine-tuning networks, the batch size is set as 96. To save memory, instead of computing the gradients of a whole batch in one go, we split every batch into smaller sub-batches, and accumulating gradients by processing them sequentially. We do data augmentation (randomly cropping and flipping) for the inputs of all networks. Like the implementation in [5], the learning rate of every network starts from 10 3 and then gets reduced to its 10% when training accuracy saturation happens, until it reaches to We apply dropout for first two fully-connected layers of every stream. When training single networks, dropout ratios of both spatial network and dynamic image network are set as 0.85 while the dropout ratio of temporal network are set as 0.9. When training different two-stream or triple-stream networks, all the dropout ratios are set as For validation, we sample T = 5 frames (T static frames, T stacks of optical flows and T dynamic images) with their horizontal flips without cropping for each video. Details of FF construction: To construct deep-learned representations, we also sample T = 5 frames with their horizontal flips without cropping for each video. Then these frames are fed into trained networks and the 4096-D features of ReLU7 layer are extracted by the process in section II-A. B. Performance of fusing dynamic image network into single networks First, we evaluate the effects of fusing dynamic image network with single spatial and temporal networks. The results 515

4 TABLE I: Effect evaluation of fusing dynamic image network to single spatial or temporal networks Method single network CF SF FF or Deep feature of single-stream network dynamic image 79.36% % spatial 82.13% % spatial+dynamic image % 85.62% 84.46% temporal 85.04% % temporal+dynamic image % 88.30% 89.32% are shown as in Table I. It can be observed that, when spatial and dynamic image network are fused by SF, the performance was improved remarkably. The similar improvement has also been seen in the implementation of [18]. Besides, we show that, by fusing temporal and dynamic image network using SF, performance can be also improved over single temporal network. FF also performs much better than single networks. Note that our deep-learned representation of single networks performs similarly to its end-to-end implementation. Thus it can be asserted that the performance improvement is contributed by the addition of dynamic image network rather than only the representation process itself. However, when networks are fused by CF, performance decreases for the both input combinations. Our results show that dynamic image network is not suitable for being fused and trained together with spatial network or temporal network by CF. We are going to discuss about possible reason in section III-D. C. Performance of triple-stream networks Then, we compare the performance of triple-stream networks with the performance of spatial-temporal two-stream CF [5]. As shown in Table II, both triple-stream SF and triplestream FF performs better than spatial-temporal two-stream fusion. However, triple-stream network CF performs a little worse than spatial-temporal two-steam CF. We are going to discuss about possible reason in section III-D. TABLE II: Comparison between spatial-temporal two-stream CF and triple-stream networks in UCF101 split1 Method Accuracy Spatial-temporal two-stream CF 90.70% Triple-stream CF 90.30% Triple-stream SF 91.03% Triple-stream FF 90.87% Finally, we evaluate triple-stream networks of higher performance (SF and FF) in all three splits of UCF 101. As shown in Table III, both of the two methods perform better than twostream CF in all three splits. It shows that the two methods can gain consistent improvement over two-stream CF. D. Discussion for unsatisfying performance of fusing dynamic image network by CF We observe that, the networks would be much harder to optimize when dynamic image network is fused by CF. For TABLE III: Comparison between spatial-temporal two-stream CF and triple-stream networks in three UCF101 splits Method split 1 split 2 split 3 Average Spatial-temporal two-stream CF 90.70% 90.80% 91.59% 91.03% Triple-stream SF 91.03% 91.67% 92.53% 91.74% Triple-stream FF 90.87% 91.38% 92.67% 91.64% example, Table IV shows the training statuses of spatialtemporal two-stream CF and triple-stream CF when the statuses saturated. Both training error and training loss in spatial and temporal stream of the latter were much higher than those of the former. In terms of the possible reason for optimizing difficulty, as we observe, it lies in the redundancy of information provided by dynamic image and spatial stream, dynamic image and temporal stream. To confirm this, we randomly select 200 video frames and their corresponding dynamic images, stacks of optical flows respectively from 200 different videos. Then we feed them into triple-stream network and extract the feature maps of ReLU5 layer, which are the feature maps we concatenate for CF evaluation. We then convert the dimensional feature maps into dimensional vectors. We refer such vectors as R5Vs. We compute the average correlation coefficient among R5Vs from different streams, the results are shown in Table V. Compared with correlation between the R5Vs from spatial and temporal stream, the correlation between the R5Vs from spatial and dynamic image stream, temporal and dynamic image stream are much higher. We also try to further illustrate this situation in Figure 5 by projecting R5Vs into a two-dimensional surface. As is known, high redundancy and strong correlations in data always result in difficulty of optimization and thus cause the unsatisfying performance of many machine learning algorithms [20], [21], [22]. R5Vs are actually the inputs of following fully-connected layers. The redundancy of R5Vs can cause bad influence on the classification ability of fullyconnected layers. On the contrary, in SF, the prediction scores are already the results from optimization so the bad influence of redundancy is limited. Also, in FF, because data whitening is implemented, even though we keep the same dimension of datas after whitening, the correlation among datas is significantly reduced. Thus, no matter FF or SF can be benefited from fusing dynamic image stream. Our future goal is to find ways to reduce the correlation and redundancy in the concatenated feature maps after dynamic image stream is fused in end-to- 516

5 end manner. TABLE IV: Comparison of final training statuses between spatial-temporal two-stream CF and triple-stream CF Networks Error Loss Spatial-temporal Spatial stream two-stream CF Temporal stream Triple-stream CF Spatial stream Temporal stream TABLE V: Average correlation coefficient among R5Vs from different streams Spatial-temporal Spatial-dynamic Temporal-dynamic Fig. 5: Illustration of two-dimensional points projected from R5Vs of spatial (blue rhombus), temporal (red square) and dynamic image (green triangle) streams. Even though some blue rhombus mix with red squares in position. Overall, it is clear that most of the blue rhombus assemble on the left while the red squares assemble on the right. Yet, compared with the tiny mixture of red square and blue rhombus, green triangles mix greatly with both two other figures. Thus it is obvious that there exists high redundancy between feature maps from dynamic image stream and two other streams. IV. CONCLUSION In our work, we first evaluate the effects of fusing dynamic image network to single spatial or temporal network. Except fusing related networks by CF, all methods gain noticeable improvement over single networks. Then we evaluate triplestream network. Except CF, the other two triple-stream fusion methods gain consistent improvement over two-stream fusion. Our results show that, by suitable fusion methods, the addition of dynamic image stream can help both single networks and fused spatial-temporal two-stream networks better recognize human actions in videos. Regarding the possible for unsatisfying performance of fusing dynamic image network by CF, we suppose it is the redundancy of information between dynamic image and spatial streams, dynamic image and temporal streams, which could make networks more difficult to optimize. We plan to further find out and prove the reason and come up with solutions to get better performance. ACKNOWLEDGEMENT This research is supported by the JSPS Grant-in-Aid for Challenging Exploratory Research (No.16K12460), the JSPS Grant-in-Aid for Young Scientists (B) (No.17K12714), the JST Center of Innovation Program and PhD Program Toryumon of Nagoya University. REFERENCES [1] H. Wang and C. Schmid, Action recognition with improved trajectories, in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp [2] H. Wang, A. Kläser, C. Schmid, and C.-L. Liu, Dense trajectories and motion boundary descriptors for action recognition, International journal of computer vision, vol. 103, no. 1, pp , [3] L. Sun, K. Jia, D.-Y. Yeung, and B. E. Shi, Human action recognition using factorized spatio-temporal convolutional networks, in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp [4] J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici, Beyond short snippets: Deep networks for video classification, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp [5] C. Feichtenhofer, A. Pinz, and A. Zisserman, Convolutional two-stream network fusion for video action recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp [6] N. Ballas, L. Yao, C. Pal, and A. Courville, Delving deeper into convolutional networks for learning video representations, arxiv preprint arxiv: , [7] H. Jung, S. Lee, J. Yim, S. Park, and J. Kim, Joint fine-tuning in deep neural networks for facial expression recognition, in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp [8] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, Large-scale video classification with convolutional neural networks, in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2014, pp [9] C. Feichtenhofer, A. Pinz, and R. P. Wildes, Spatiotemporal residual networks for video action recognition, CoRR, vol. abs/ , [Online]. Available: [10] K. Simonyan and A. Zisserman, Two-stream convolutional networks for action recognition in videos, in Advances in neural information processing systems, 2014, pp [11] L. Wang, Y. Qiao, and X. Tang, Action recognition with trajectorypooled deep-convolutional descriptors, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp [12] R. Girdhar, D. Ramanan, A. Gupta, J. Sivic, and B. C. Russell, Actionvlad: Learning spatio-temporal aggregation for action classification, CoRR, vol. abs/ , [Online]. Available: [13] H. Bilen, B. Fernando, E. Gavves, A. Vedaldi, and S. Gould, Dynamic image networks for action recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp [14] B. Fernando, E. Gavves, J. Oramas, A. Ghodrati, and T. Tuytelaars, Rank pooling for action recognition, IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 4, pp , [15] B. Fernando and S. Gould, Learning end-to-end video classification with rank-pooling, in International Conference on Machine Learning, 2016, pp [16] M. S. Aliakbarian, F. Saleh, B. Fernando, M. Salzmann, L. Petersson, and L. Andersson, Deep action-and context-aware sequence learning for activity recognition and anticipation, arxiv preprint arxiv: , [17] C. Xiong, L. Liu, X. Zhao, S. Yan, and T.-K. Kim, Convolutional fusion network for face verification in the wild, IEEE Transactions on Circuits and Systems for Video Technology, vol. 26, no. 3, pp ,

6 [18] K. Soomro, A. R. Zamir, and M. Shah, Ucf101: A dataset of 101 human actions classes from videos in the wild, arxiv preprint arxiv: , [19] K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, arxiv preprint arxiv: , [20] L. Yu and H. Liu, Efficient feature selection via analysis of relevance and redundancy, Journal of machine learning research, vol. 5, no. Oct, pp , [21] B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky, Sparse convolutional neural networks, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp [22] Z. Pan, A. G. Rust, and H. Bolouri, Image redundancy reduction for neural network classification using discrete cosine transforms, in Neural Networks, IJCNN 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on, vol. 3. IEEE, 2000, pp

arxiv: v1 [cs.cv] 29 Apr 2016

arxiv: v1 [cs.cv] 29 Apr 2016 Improved Dense Trajectory with Cross Streams arxiv:1604.08826v1 [cs.cv] 29 Apr 2016 ABSTRACT Katsunori Ohnishi Graduate School of Information Science and Technology University of Tokyo ohnishi@mi.t.utokyo.ac.jp

More information

Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition

Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition Kensho Hara, Hirokatsu Kataoka, Yutaka Satoh National Institute of Advanced Industrial Science and Technology (AIST) Tsukuba,

More information

Deep Local Video Feature for Action Recognition

Deep Local Video Feature for Action Recognition Deep Local Video Feature for Action Recognition Zhenzhong Lan 1 Yi Zhu 2 Alexander G. Hauptmann 1 Shawn Newsam 2 1 Carnegie Mellon University 2 University of California, Merced {lanzhzh,alex}@cs.cmu.edu

More information

Deep Learning For Video Classification. Presented by Natalie Carlebach & Gil Sharon

Deep Learning For Video Classification. Presented by Natalie Carlebach & Gil Sharon Deep Learning For Video Classification Presented by Natalie Carlebach & Gil Sharon Overview Of Presentation Motivation Challenges of video classification Common datasets 4 different methods presented in

More information

EasyChair Preprint. Real-Time Action Recognition based on Enhanced Motion Vector Temporal Segment Network

EasyChair Preprint. Real-Time Action Recognition based on Enhanced Motion Vector Temporal Segment Network EasyChair Preprint 730 Real-Time Action Recognition based on Enhanced Motion Vector Temporal Segment Network Xue Bai, Enqing Chen and Haron Chweya Tinega EasyChair preprints are intended for rapid dissemination

More information

T-C3D: Temporal Convolutional 3D Network for Real-Time Action Recognition

T-C3D: Temporal Convolutional 3D Network for Real-Time Action Recognition The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18) T-C3D: Temporal Convolutional 3D Network for Real-Time Action Recognition Kun Liu, 1 Wu Liu, 1 Chuang Gan, 2 Mingkui Tan, 3 Huadong

More information

arxiv: v1 [cs.cv] 14 Jul 2017

arxiv: v1 [cs.cv] 14 Jul 2017 Temporal Modeling Approaches for Large-scale Youtube-8M Video Understanding Fu Li, Chuang Gan, Xiao Liu, Yunlong Bian, Xiang Long, Yandong Li, Zhichao Li, Jie Zhou, Shilei Wen Baidu IDL & Tsinghua University

More information

Large-scale Video Classification with Convolutional Neural Networks

Large-scale Video Classification with Convolutional Neural Networks Large-scale Video Classification with Convolutional Neural Networks Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, Li Fei-Fei Note: Slide content mostly from : Bay Area

More information

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. By Joa õ Carreira and Andrew Zisserman Presenter: Zhisheng Huang 03/02/2018

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. By Joa õ Carreira and Andrew Zisserman Presenter: Zhisheng Huang 03/02/2018 Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset By Joa õ Carreira and Andrew Zisserman Presenter: Zhisheng Huang 03/02/2018 Outline: Introduction Action classification architectures

More information

Temporal Difference Networks for Video Action Recognition

Temporal Difference Networks for Video Action Recognition Temporal Difference Networks for Video Action Recognition Joe Yue-Hei Ng Larry S. Davis University of Maryland, College Park {yhng,lsd}@umiacs.umd.edu Abstract Deep convolutional neural networks have been

More information

arxiv: v1 [cs.cv] 22 Nov 2017

arxiv: v1 [cs.cv] 22 Nov 2017 D Nets: New Architecture and Transfer Learning for Video Classification Ali Diba,4,, Mohsen Fayyaz,, Vivek Sharma, Amir Hossein Karami 4, Mohammad Mahdi Arzani 4, Rahman Yousefzadeh 4, Luc Van Gool,4 ESAT-PSI,

More information

MiCT: Mixed 3D/2D Convolutional Tube for Human Action Recognition

MiCT: Mixed 3D/2D Convolutional Tube for Human Action Recognition MiCT: Mixed 3D/2D Convolutional Tube for Human Action Recognition Yizhou Zhou 1 Xiaoyan Sun 2 Zheng-Jun Zha 1 Wenjun Zeng 2 1 University of Science and Technology of China 2 Microsoft Research Asia zyz0205@mail.ustc.edu.cn,

More information

arxiv: v2 [cs.cv] 2 Apr 2018

arxiv: v2 [cs.cv] 2 Apr 2018 Depth of 3D CNNs Depth of 2D CNNs Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? arxiv:1711.09577v2 [cs.cv] 2 Apr 2018 Kensho Hara, Hirokatsu Kataoka, Yutaka Satoh National Institute

More information

Eigen-Evolution Dense Trajectory Descriptors

Eigen-Evolution Dense Trajectory Descriptors Eigen-Evolution Dense Trajectory Descriptors Yang Wang, Vinh Tran, and Minh Hoai Stony Brook University, Stony Brook, NY 11794-2424, USA {wang33, tquangvinh, minhhoai}@cs.stonybrook.edu Abstract Trajectory-pooled

More information

Two-Stream Convolutional Networks for Action Recognition in Videos

Two-Stream Convolutional Networks for Action Recognition in Videos Two-Stream Convolutional Networks for Action Recognition in Videos Karen Simonyan Andrew Zisserman Cemil Zalluhoğlu Introduction Aim Extend deep Convolution Networks to action recognition in video. Motivation

More information

Study of Residual Networks for Image Recognition

Study of Residual Networks for Image Recognition Study of Residual Networks for Image Recognition Mohammad Sadegh Ebrahimi Stanford University sadegh@stanford.edu Hossein Karkeh Abadi Stanford University hosseink@stanford.edu Abstract Deep neural networks

More information

Content-Based Image Recovery

Content-Based Image Recovery Content-Based Image Recovery Hong-Yu Zhou and Jianxin Wu National Key Laboratory for Novel Software Technology Nanjing University, China zhouhy@lamda.nju.edu.cn wujx2001@nju.edu.cn Abstract. We propose

More information

CS231N Section. Video Understanding 6/1/2018

CS231N Section. Video Understanding 6/1/2018 CS231N Section Video Understanding 6/1/2018 Outline Background / Motivation / History Video Datasets Models Pre-deep learning CNN + RNN 3D convolution Two-stream What we ve seen in class so far... Image

More information

Optical Flow Guided Feature: A Fast and Robust Motion Representation for Video Action Recognition

Optical Flow Guided Feature: A Fast and Robust Motion Representation for Video Action Recognition Optical Flow Guided Feature: A Fast and Robust Motion Representation for Video Action Recognition Shuyang Sun 1,2, Zhanghui Kuang 2, Lu Sheng 3, Wanli Ouyang 1, Wei Zhang 2 1 The University of Sydney 2

More information

Long-term Temporal Convolutions for Action Recognition

Long-term Temporal Convolutions for Action Recognition 1 Long-term Temporal Convolutions for Action Recognition Gül Varol, Ivan Laptev, and Cordelia Schmid, Fellow, IEEE arxiv:1604.04494v2 [cs.cv] 2 Jun 2017 Abstract Typical human actions last several seconds

More information

Activity Recognition in Temporally Untrimmed Videos

Activity Recognition in Temporally Untrimmed Videos Activity Recognition in Temporally Untrimmed Videos Bryan Anenberg Stanford University anenberg@stanford.edu Norman Yu Stanford University normanyu@stanford.edu Abstract We investigate strategies to apply

More information

A Unified Method for First and Third Person Action Recognition

A Unified Method for First and Third Person Action Recognition A Unified Method for First and Third Person Action Recognition Ali Javidani Department of Computer Science and Engineering Shahid Beheshti University Tehran, Iran a.javidani@mail.sbu.ac.ir Ahmad Mahmoudi-Aznaveh

More information

GAIT RECOGNITION BASED ON CONVOLUTIONAL NEURAL NETWORKS

GAIT RECOGNITION BASED ON CONVOLUTIONAL NEURAL NETWORKS GAIT RECOGNITION BASED ON CONVOLUTIONAL NEURAL NETWORKS A. Sokolova a, A.Konushin a, b a National Research University Higher School of Economics, Moscow, Russia - ale4kasokolova@gmail.com, akonushin@hse.ru

More information

arxiv: v7 [cs.cv] 21 Apr 2018

arxiv: v7 [cs.cv] 21 Apr 2018 End-to-end Video-level Representation Learning for Action Recognition Jiagang Zhu 1,2, Wei Zou 1, Zheng Zhu 1,2 1 Institute of Automation, Chinese Academy of Sciences (CASIA) 2 University of Chinese Academy

More information

arxiv: v3 [cs.cv] 12 Apr 2018

arxiv: v3 [cs.cv] 12 Apr 2018 A Closer Look at Spatiotemporal Convolutions for Action Recognition Du Tran 1, Heng Wang 1, Lorenzo Torresani 1,2, Jamie Ray 1, Yann LeCun 1, Manohar Paluri 1 1 Facebook Research 2 Dartmouth College {trandu,hengwang,torresani,jamieray,yann,mano}@fb.com

More information

Channel Locality Block: A Variant of Squeeze-and-Excitation

Channel Locality Block: A Variant of Squeeze-and-Excitation Channel Locality Block: A Variant of Squeeze-and-Excitation 1 st Huayu Li Northern Arizona University Flagstaff, United State Northern Arizona University hl459@nau.edu arxiv:1901.01493v1 [cs.lg] 6 Jan

More information

arxiv: v1 [cs.cv] 26 Jul 2018

arxiv: v1 [cs.cv] 26 Jul 2018 A Better Baseline for AVA Rohit Girdhar João Carreira Carl Doersch Andrew Zisserman DeepMind Carnegie Mellon University University of Oxford arxiv:1807.10066v1 [cs.cv] 26 Jul 2018 Abstract We introduce

More information

Convolution Neural Networks for Chinese Handwriting Recognition

Convolution Neural Networks for Chinese Handwriting Recognition Convolution Neural Networks for Chinese Handwriting Recognition Xu Chen Stanford University 450 Serra Mall, Stanford, CA 94305 xchen91@stanford.edu Abstract Convolutional neural networks have been proven

More information

Robust Face Recognition Based on Convolutional Neural Network

Robust Face Recognition Based on Convolutional Neural Network 2017 2nd International Conference on Manufacturing Science and Information Engineering (ICMSIE 2017) ISBN: 978-1-60595-516-2 Robust Face Recognition Based on Convolutional Neural Network Ying Xu, Hui Ma,

More information

Aggregating Frame-level Features for Large-Scale Video Classification

Aggregating Frame-level Features for Large-Scale Video Classification Aggregating Frame-level Features for Large-Scale Video Classification Shaoxiang Chen 1, Xi Wang 1, Yongyi Tang 2, Xinpeng Chen 3, Zuxuan Wu 1, Yu-Gang Jiang 1 1 Fudan University 2 Sun Yat-Sen University

More information

3D CONVOLUTIONAL NEURAL NETWORK WITH MULTI-MODEL FRAMEWORK FOR ACTION RECOGNITION

3D CONVOLUTIONAL NEURAL NETWORK WITH MULTI-MODEL FRAMEWORK FOR ACTION RECOGNITION 3D CONVOLUTIONAL NEURAL NETWORK WITH MULTI-MODEL FRAMEWORK FOR ACTION RECOGNITION Longlong Jing 1, Yuancheng Ye 1, Xiaodong Yang 3, Yingli Tian 1,2 1 The Graduate Center, 2 The City College, City University

More information

arxiv: v1 [cs.cv] 29 Nov 2017

arxiv: v1 [cs.cv] 29 Nov 2017 Optical Flow Guided : A Fast and Robust Motion Representation for Video Action Recognition Shuyang Sun 1, 2, Zhanghui Kuang 1, Wanli Ouyang 2, Lu Sheng 3, and Wei Zhang 1 arxiv:1711.11152v1 [cs.v] 29 Nov

More information

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides Deep Learning in Visual Recognition Thanks Da Zhang for the slides Deep Learning is Everywhere 2 Roadmap Introduction Convolutional Neural Network Application Image Classification Object Detection Object

More information

GPU Accelerated Sequence Learning for Action Recognition. Yemin Shi

GPU Accelerated Sequence Learning for Action Recognition. Yemin Shi GPU Accelerated Sequence Learning for Action Recognition Yemin Shi shiyemin@pku.edu.cn 2018-03 1 Background Object Recognition (Image Classification) Action Recognition (Video Classification) Action Recognition

More information

arxiv: v1 [cs.cv] 26 Apr 2018

arxiv: v1 [cs.cv] 26 Apr 2018 Deep Keyframe Detection in Human Action Videos Xiang Yan1,2, Syed Zulqarnain Gilani2, Hanlin Qin1, Mingtao Feng3, Liang Zhang4, and Ajmal Mian2 arxiv:1804.10021v1 [cs.cv] 26 Apr 2018 1 School of Physics

More information

StNet: Local and Global Spatial-Temporal Modeling for Action Recognition

StNet: Local and Global Spatial-Temporal Modeling for Action Recognition StNet: Local and Global Spatial-Temporal Modeling for Action Recognition Dongliang He 1 Zhichao Zhou 1 Chuang Gan 2 Fu Li 1 Xiao Liu 1 Yandong Li 3 Limin Wang 4 Shilei Wen 1 Department of Computer Vision

More information

arxiv: v2 [cs.cv] 6 May 2018

arxiv: v2 [cs.cv] 6 May 2018 Appearance-and-Relation Networks for Video Classification Limin Wang 1,2 Wei Li 3 Wen Li 2 Luc Van Gool 2 1 State Key Laboratory for Novel Software Technology, Nanjing University, China 2 Computer Vision

More information

Improving Face Recognition by Exploring Local Features with Visual Attention

Improving Face Recognition by Exploring Local Features with Visual Attention Improving Face Recognition by Exploring Local Features with Visual Attention Yichun Shi and Anil K. Jain Michigan State University Difficulties of Face Recognition Large variations in unconstrained face

More information

Learning Latent Sub-events in Activity Videos Using Temporal Attention Filters

Learning Latent Sub-events in Activity Videos Using Temporal Attention Filters Learning Latent Sub-events in Activity Videos Using Temporal Attention Filters AJ Piergiovanni, Chenyou Fan, and Michael S Ryoo School of Informatics and Computing, Indiana University, Bloomington, IN

More information

arxiv: v1 [cs.cv] 15 Apr 2016

arxiv: v1 [cs.cv] 15 Apr 2016 arxiv:0.09v [cs.cv] Apr 0 Long-term Temporal Convolutions for Action Recognition Gül Varol, Ivan Laptev Cordelia Schmid Inria Abstract. Typical human actions last several seconds and exhibit characteristic

More information

Temporal Activity Detection in Untrimmed Videos with Recurrent Neural Networks

Temporal Activity Detection in Untrimmed Videos with Recurrent Neural Networks Temporal Activity Detection in Untrimmed Videos with Recurrent Neural Networks Alberto Montes al.montes.gomez@gmail.com Santiago Pascual TALP Research Center santiago.pascual@tsc.upc.edu Amaia Salvador

More information

Extend the shallow part of Single Shot MultiBox Detector via Convolutional Neural Network

Extend the shallow part of Single Shot MultiBox Detector via Convolutional Neural Network Extend the shallow part of Single Shot MultiBox Detector via Convolutional Neural Network Liwen Zheng, Canmiao Fu, Yong Zhao * School of Electronic and Computer Engineering, Shenzhen Graduate School of

More information

Deeply-Supervised CNN Model for Action Recognition with Trainable Feature Aggregation

Deeply-Supervised CNN Model for Action Recognition with Trainable Feature Aggregation Deeply-Supervised CNN Model for Action Recognition with Trainable Feature Yang Li, Kan Li, Xinxin Wang School of Computer Science, Beijing Institute of Technology, Beijing, China {yanglee, likan, wangxx}@bit.edu.cn

More information

arxiv: v1 [cs.cv] 13 Aug 2017

arxiv: v1 [cs.cv] 13 Aug 2017 Lattice Long Short-Term Memory for Human Action Recognition Lin Sun 1,2, Kui Jia 3, Kevin Chen 2, Dit Yan Yeung 1, Bertram E. Shi 1, and Silvio Savarese 2 arxiv:1708.03958v1 [cs.cv] 13 Aug 2017 1 The Hong

More information

Spatio-Temporal Vector of Locally Max Pooled Features for Action Recognition in Videos

Spatio-Temporal Vector of Locally Max Pooled Features for Action Recognition in Videos Spatio-Temporal Vector of Locally Max Pooled Features for Action Recognition in Videos Ionut Cosmin Duta 1 Bogdan Ionescu 2 Kiyoharu Aizawa 3 Nicu Sebe 1 1 University of Trento, Italy 2 University Politehnica

More information

Feature-Fused SSD: Fast Detection for Small Objects

Feature-Fused SSD: Fast Detection for Small Objects Feature-Fused SSD: Fast Detection for Small Objects Guimei Cao, Xuemei Xie, Wenzhe Yang, Quan Liao, Guangming Shi, Jinjian Wu School of Electronic Engineering, Xidian University, China xmxie@mail.xidian.edu.cn

More information

Deep CNN Object Features for Improved Action Recognition in Low Quality Videos

Deep CNN Object Features for Improved Action Recognition in Low Quality Videos Copyright 2017 American Scientific Publishers Advanced Science Letters All rights reserved Vol. 23, 11360 11364, 2017 Printed in the United States of America Deep CNN Object Features for Improved Action

More information

SHIV SHAKTI International Journal in Multidisciplinary and Academic Research (SSIJMAR) Vol. 7, No. 2, April 2018 (ISSN )

SHIV SHAKTI International Journal in Multidisciplinary and Academic Research (SSIJMAR) Vol. 7, No. 2, April 2018 (ISSN ) SHIV SHAKTI International Journal in Multidisciplinary and Academic Research (SSIJMAR) Vol. 7, No. 2, April 2018 (ISSN 2278 5973) Facial Recognition Using Deep Learning Rajeshwar M, Sanjit Singh Chouhan,

More information

Dynamic Image Networks for Action Recognition

Dynamic Image Networks for Action Recognition Dynamic Image Networks for Action Recognition Hakan Bilen Basura Fernando Efstratios Gavves Andrea Vedaldi Stephen Gould University of Oxford The Australian National University QUVA Lab, University of

More information

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU, Machine Learning 10-701, Fall 2015 Deep Learning Eric Xing (and Pengtao Xie) Lecture 8, October 6, 2015 Eric Xing @ CMU, 2015 1 A perennial challenge in computer vision: feature engineering SIFT Spin image

More information

Hierarchical Video Frame Sequence Representation with Deep Convolutional Graph Network

Hierarchical Video Frame Sequence Representation with Deep Convolutional Graph Network Hierarchical Video Frame Sequence Representation with Deep Convolutional Graph Network Feng Mao [0000 0001 6171 3168], Xiang Wu [0000 0003 2698 2156], Hui Xue, and Rong Zhang Alibaba Group, Hangzhou, China

More information

Inception and Residual Networks. Hantao Zhang. Deep Learning with Python.

Inception and Residual Networks. Hantao Zhang. Deep Learning with Python. Inception and Residual Networks Hantao Zhang Deep Learning with Python https://en.wikipedia.org/wiki/residual_neural_network Deep Neural Network Progress from Large Scale Visual Recognition Challenge (ILSVRC)

More information

Face Recognition A Deep Learning Approach

Face Recognition A Deep Learning Approach Face Recognition A Deep Learning Approach Lihi Shiloh Tal Perl Deep Learning Seminar 2 Outline What about Cat recognition? Classical face recognition Modern face recognition DeepFace FaceNet Comparison

More information

P-CNN: Pose-based CNN Features for Action Recognition. Iman Rezazadeh

P-CNN: Pose-based CNN Features for Action Recognition. Iman Rezazadeh P-CNN: Pose-based CNN Features for Action Recognition Iman Rezazadeh Introduction automatic understanding of dynamic scenes strong variations of people and scenes in motion and appearance Fine-grained

More information

HENet: A Highly Efficient Convolutional Neural. Networks Optimized for Accuracy, Speed and Storage

HENet: A Highly Efficient Convolutional Neural. Networks Optimized for Accuracy, Speed and Storage HENet: A Highly Efficient Convolutional Neural Networks Optimized for Accuracy, Speed and Storage Qiuyu Zhu Shanghai University zhuqiuyu@staff.shu.edu.cn Ruixin Zhang Shanghai University chriszhang96@shu.edu.cn

More information

Supplementary material for Analyzing Filters Toward Efficient ConvNet

Supplementary material for Analyzing Filters Toward Efficient ConvNet Supplementary material for Analyzing Filters Toward Efficient Net Takumi Kobayashi National Institute of Advanced Industrial Science and Technology, Japan takumi.kobayashi@aist.go.jp A. Orthonormal Steerable

More information

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 18: Deep learning and Vision: Convolutional neural networks Teacher: Gianni A. Di Caro DEEP, SHALLOW, CONNECTED, SPARSE? Fully connected multi-layer feed-forward perceptrons: More powerful

More information

Attention-based Temporal Weighted Convolutional Neural Network for Action Recognition

Attention-based Temporal Weighted Convolutional Neural Network for Action Recognition Attention-based Temporal Weighted Convolutional Neural Network for Action Recognition Jinliang Zang 1, Le Wang 1, Ziyi Liu 1, Qilin Zhang 2, Zhenxing Niu 3, Gang Hua 4, and Nanning Zheng 1 1 Xi an Jiaotong

More information

Towards Good Practices for Multi-modal Fusion in Large-scale Video Classification

Towards Good Practices for Multi-modal Fusion in Large-scale Video Classification Towards Good Practices for Multi-modal Fusion in Large-scale Video Classification Jinlai Liu, Zehuan Yuan, and Changhu Wang Bytedance AI Lab {liujinlai.licio,yuanzehuan,wangchanghu}@bytedance.com Abstract.

More information

Real-time Action Recognition with Enhanced Motion Vector CNNs

Real-time Action Recognition with Enhanced Motion Vector CNNs Real-time Action Recognition with Enhanced Motion Vector CNNs Bowen Zhang 1,2 Limin Wang 1,3 Zhe Wang 1 Yu Qiao 1 Hanli Wang 2 1 Shenzhen key lab of Comp. Vis. & Pat. Rec., Shenzhen Institutes of Advanced

More information

Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification Xiaodong Yang, Pavlo Molchanov, Jan Kautz INTELLIGENT VIDEO ANALYTICS Surveillance event detection Human-computer interaction

More information

arxiv: v3 [cs.cv] 2 Aug 2017

arxiv: v3 [cs.cv] 2 Aug 2017 Action Detection ( 4.3) Tube Proposal Network ( 4.1) Tube Convolutional Neural Network (T-CNN) for Action Detection in Videos Rui Hou, Chen Chen, Mubarak Shah Center for Research in Computer Vision (CRCV),

More information

Tube Convolutional Neural Network (T-CNN) for Action Detection in Videos

Tube Convolutional Neural Network (T-CNN) for Action Detection in Videos Tube Convolutional Neural Network (T-CNN) for Action Detection in Videos Rui Hou, Chen Chen, Mubarak Shah Center for Research in Computer Vision (CRCV), University of Central Florida (UCF) houray@gmail.com,

More information

Weighted Convolutional Neural Network. Ensemble.

Weighted Convolutional Neural Network. Ensemble. Weighted Convolutional Neural Network Ensemble Xavier Frazão and Luís A. Alexandre Dept. of Informatics, Univ. Beira Interior and Instituto de Telecomunicações Covilhã, Portugal xavierfrazao@gmail.com

More information

Large-scale gesture recognition based on Multimodal data with C3D and TSN

Large-scale gesture recognition based on Multimodal data with C3D and TSN Large-scale gesture recognition based on Multimodal data with C3D and TSN July 6, 2017 1 Team details Team name ASU Team leader name Yunan Li Team leader address, phone number and email address: Xidian

More information

Long-term Temporal Convolutions for Action Recognition INRIA

Long-term Temporal Convolutions for Action Recognition INRIA Longterm Temporal Convolutions for Action Recognition Gul Varol Ivan Laptev INRIA Cordelia Schmid 2 Motivation Current CNN methods for action recognition learn representations for short intervals (116

More information

Attention-Based Temporal Weighted Convolutional Neural Network for Action Recognition

Attention-Based Temporal Weighted Convolutional Neural Network for Action Recognition Attention-Based Temporal Weighted Convolutional Neural Network for Action Recognition Jinliang Zang 1, Le Wang 1(&), Ziyi Liu 1, Qilin Zhang 2, Gang Hua 3, and Nanning Zheng 1 1 Xi an Jiaotong University,

More information

MCMOT: Multi-Class Multi-Object Tracking using Changing Point Detection

MCMOT: Multi-Class Multi-Object Tracking using Changing Point Detection MCMOT: Multi-Class Multi-Object Tracking using Changing Point Detection ILSVRC 2016 Object Detection from Video Byungjae Lee¹, Songguo Jin¹, Enkhbayar Erdenee¹, Mi Young Nam², Young Gui Jung², Phill Kyu

More information

Proceedings of the International MultiConference of Engineers and Computer Scientists 2018 Vol I IMECS 2018, March 14-16, 2018, Hong Kong

Proceedings of the International MultiConference of Engineers and Computer Scientists 2018 Vol I IMECS 2018, March 14-16, 2018, Hong Kong , March 14-16, 2018, Hong Kong , March 14-16, 2018, Hong Kong , March 14-16, 2018, Hong Kong , March 14-16, 2018, Hong Kong TABLE I CLASSIFICATION ACCURACY OF DIFFERENT PRE-TRAINED MODELS ON THE TEST DATA

More information

Recurrent Neural Networks and Transfer Learning for Action Recognition

Recurrent Neural Networks and Transfer Learning for Action Recognition Recurrent Neural Networks and Transfer Learning for Action Recognition Andrew Giel Stanford University agiel@stanford.edu Ryan Diaz Stanford University ryandiaz@stanford.edu Abstract We have taken on the

More information

Body Joint guided 3D Deep Convolutional Descriptors for Action Recognition

Body Joint guided 3D Deep Convolutional Descriptors for Action Recognition 1 Body Joint guided 3D Deep Convolutional Descriptors for Action Recognition arxiv:1704.07160v2 [cs.cv] 25 Apr 2017 Congqi Cao, Yifan Zhang, Member, IEEE, Chunjie Zhang, Member, IEEE, and Hanqing Lu, Senior

More information

Deep Learning. Deep Learning provided breakthrough results in speech recognition and image classification. Why?

Deep Learning. Deep Learning provided breakthrough results in speech recognition and image classification. Why? Data Mining Deep Learning Deep Learning provided breakthrough results in speech recognition and image classification. Why? Because Speech recognition and image classification are two basic examples of

More information

arxiv: v1 [cs.cv] 19 May 2015

arxiv: v1 [cs.cv] 19 May 2015 Action Recognition with Trajectory-Pooled Deep-Convolutional Descriptors Limin Wang 1,2 Yu Qiao 2 Xiaoou Tang 1,2 1 Department of Information Engineering, The Chinese University of Hong Kong 2 Shenzhen

More information

Deep Learning Based Real-time Object Recognition System with Image Web Crawler

Deep Learning Based Real-time Object Recognition System with Image Web Crawler , pp.103-110 http://dx.doi.org/10.14257/astl.2016.142.19 Deep Learning Based Real-time Object Recognition System with Image Web Crawler Myung-jae Lee 1, Hyeok-june Jeong 1, Young-guk Ha 2 1 Department

More information

CENG 783. Special topics in. Deep Learning. AlchemyAPI. Week 11. Sinan Kalkan

CENG 783. Special topics in. Deep Learning. AlchemyAPI. Week 11. Sinan Kalkan CENG 783 Special topics in Deep Learning AlchemyAPI Week 11 Sinan Kalkan TRAINING A CNN Fig: http://www.robots.ox.ac.uk/~vgg/practicals/cnn/ Feed-forward pass Note that this is written in terms of the

More information

Smart Parking System using Deep Learning. Sheece Gardezi Supervised By: Anoop Cherian Peter Strazdins

Smart Parking System using Deep Learning. Sheece Gardezi Supervised By: Anoop Cherian Peter Strazdins Smart Parking System using Deep Learning Sheece Gardezi Supervised By: Anoop Cherian Peter Strazdins Content Labeling tool Neural Networks Visual Road Map Labeling tool Data set Vgg16 Resnet50 Inception_v3

More information

Recognize Complex Events from Static Images by Fusing Deep Channels Supplementary Materials

Recognize Complex Events from Static Images by Fusing Deep Channels Supplementary Materials Recognize Complex Events from Static Images by Fusing Deep Channels Supplementary Materials Yuanjun Xiong 1 Kai Zhu 1 Dahua Lin 1 Xiaoou Tang 1,2 1 Department of Information Engineering, The Chinese University

More information

MULTI-VIEW GAIT RECOGNITION USING 3D CONVOLUTIONAL NEURAL NETWORKS. Thomas Wolf, Mohammadreza Babaee, Gerhard Rigoll

MULTI-VIEW GAIT RECOGNITION USING 3D CONVOLUTIONAL NEURAL NETWORKS. Thomas Wolf, Mohammadreza Babaee, Gerhard Rigoll MULTI-VIEW GAIT RECOGNITION USING 3D CONVOLUTIONAL NEURAL NETWORKS Thomas Wolf, Mohammadreza Babaee, Gerhard Rigoll Technische Universität München Institute for Human-Machine Communication Theresienstrae

More information

Semi-Coupled Two-Stream Fusion ConvNets for Action Recognition at Extremely Low Resolutions

Semi-Coupled Two-Stream Fusion ConvNets for Action Recognition at Extremely Low Resolutions 2017 IEEE Winter Conference on Applications of Computer Vision Semi-Coupled Two-Stream Fusion ConvNets for Action Recognition at Extremely Low Resolutions Jiawei Chen, Jonathan Wu, Janusz Konrad, Prakash

More information

REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION

REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION Kingsley Kuan 1, Gaurav Manek 1, Jie Lin 1, Yuan Fang 1, Vijay Chandrasekhar 1,2 Institute for Infocomm Research, A*STAR, Singapore 1 Nanyang Technological

More information

Using Machine Learning for Classification of Cancer Cells

Using Machine Learning for Classification of Cancer Cells Using Machine Learning for Classification of Cancer Cells Camille Biscarrat University of California, Berkeley I Introduction Cell screening is a commonly used technique in the development of new drugs.

More information

SocialML: machine learning for social media video creators

SocialML: machine learning for social media video creators SocialML: machine learning for social media video creators Tomasz Trzcinski a,b, Adam Bielski b, Pawel Cyrta b and Matthew Zak b a Warsaw University of Technology b Tooploox firstname.lastname@tooploox.com

More information

COMPRESSED-DOMAIN VIDEO CLASSIFICATION WITH DEEP NEURAL NETWORKS: THERE S WAY TOO MUCH INFORMATION TO DECODE THE MATRIX

COMPRESSED-DOMAIN VIDEO CLASSIFICATION WITH DEEP NEURAL NETWORKS: THERE S WAY TOO MUCH INFORMATION TO DECODE THE MATRIX COMPRESSED-DOMAIN VIDEO CLASSIFICATION WITH DEEP NEURAL NETWORKS: THERE S WAY TOO MUCH INFORMATION TO DECODE THE MATRIX Aaron Chadha, Alhabib Abbas University College London (UCL) Electronic and Electrical

More information

arxiv: v1 [cs.cv] 6 Jul 2016

arxiv: v1 [cs.cv] 6 Jul 2016 arxiv:1607.01794v1 [cs.cv] 6 Jul 2016 VideoLSTM Convolves, Attends and Flows for Action Recognition Zhenyang Li, Efstratios Gavves, Mihir Jain, and Cees G. M. Snoek QUVA Lab, University of Amsterdam Abstract.

More information

arxiv: v3 [cs.cv] 8 May 2015

arxiv: v3 [cs.cv] 8 May 2015 Exploiting Image-trained CNN Architectures for Unconstrained Video Classification arxiv:503.0444v3 [cs.cv] 8 May 205 Shengxin Zha Northwestern University Evanston IL USA szha@u.northwestern.edu Abstract

More information

arxiv: v1 [cs.cv] 19 Jun 2018

arxiv: v1 [cs.cv] 19 Jun 2018 Multimodal feature fusion for CNN-based gait recognition: an empirical comparison F.M. Castro a,, M.J. Marín-Jiménez b, N. Guil a, N. Pérez de la Blanca c a Department of Computer Architecture, University

More information

PARTIAL STYLE TRANSFER USING WEAKLY SUPERVISED SEMANTIC SEGMENTATION. Shin Matsuo Wataru Shimoda Keiji Yanai

PARTIAL STYLE TRANSFER USING WEAKLY SUPERVISED SEMANTIC SEGMENTATION. Shin Matsuo Wataru Shimoda Keiji Yanai PARTIAL STYLE TRANSFER USING WEAKLY SUPERVISED SEMANTIC SEGMENTATION Shin Matsuo Wataru Shimoda Keiji Yanai Department of Informatics, The University of Electro-Communications, Tokyo 1-5-1 Chofugaoka,

More information

Convolutional neural networks for the analysis of broadcasted tennis games

Convolutional neural networks for the analysis of broadcasted tennis games https://doi.org/352/issn.247-73.28.2.vipc-26 28, Society for Imaging Science and Technology Convolutional neural networks for the analysis of broadcasted tennis games Grigorios Tsagkatakis, Mustafa Jaber,

More information

MSR-CNN: Applying Motion Salient Region Based Descriptors for Action Recognition

MSR-CNN: Applying Motion Salient Region Based Descriptors for Action Recognition MSR-CNN: Applying Motion Salient Region Based Descriptors for Action Recognition Zhigang Tu School of Computing, Informatics, Decision System Engineering Arizona State University Tempe, USA Email: Zhigang.Tu@asu.edu

More information

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech Convolutional Neural Networks Computer Vision Jia-Bin Huang, Virginia Tech Today s class Overview Convolutional Neural Network (CNN) Training CNN Understanding and Visualizing CNN Image Categorization:

More information

A Torch Library for Action Recognition and Detection Using CNNs and LSTMs

A Torch Library for Action Recognition and Detection Using CNNs and LSTMs A Torch Library for Action Recognition and Detection Using CNNs and LSTMs Gary Thung and Helen Jiang Stanford University {gthung, helennn}@stanford.edu Abstract It is very common now to see deep neural

More information

End-to-End Learning of Motion Representation for Video Understanding

End-to-End Learning of Motion Representation for Video Understanding End-to-End Learning of Motion Representation for Video Understanding Lijie Fan 2, Wenbing Huang 1, Chuang Gan 3, Stefano Ermon 4, Boqing Gong 1, Junzhou Huang 1 1 Tencent AI Lab, 2 Tsinghua University,

More information

Video Gesture Recognition with RGB-D-S Data Based on 3D Convolutional Networks

Video Gesture Recognition with RGB-D-S Data Based on 3D Convolutional Networks Video Gesture Recognition with RGB-D-S Data Based on 3D Convolutional Networks August 16, 2016 1 Team details Team name FLiXT Team leader name Yunan Li Team leader address, phone number and email address:

More information

UTS submission to Google YouTube-8M Challenge 2017

UTS submission to Google YouTube-8M Challenge 2017 UTS submission to Google YouTube-8M Challenge 2017 Linchao Zhu Yanbin Liu Yi Yang University of Technology Sydney {zhulinchao7, csyanbin, yee.i.yang}@gmail.com Abstract In this paper, we present our solution

More information

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks Si Chen The George Washington University sichen@gwmail.gwu.edu Meera Hahn Emory University mhahn7@emory.edu Mentor: Afshin

More information

Pedestrian Detection based on Deep Fusion Network using Feature Correlation

Pedestrian Detection based on Deep Fusion Network using Feature Correlation Pedestrian Detection based on Deep Fusion Network using Feature Correlation Yongwoo Lee, Toan Duc Bui and Jitae Shin School of Electronic and Electrical Engineering, Sungkyunkwan University, Suwon, South

More information

Segmenting Objects in Weakly Labeled Videos

Segmenting Objects in Weakly Labeled Videos Segmenting Objects in Weakly Labeled Videos Mrigank Rochan, Shafin Rahman, Neil D.B. Bruce, Yang Wang Department of Computer Science University of Manitoba Winnipeg, Canada {mrochan, shafin12, bruce, ywang}@cs.umanitoba.ca

More information

Structured Prediction using Convolutional Neural Networks

Structured Prediction using Convolutional Neural Networks Overview Structured Prediction using Convolutional Neural Networks Bohyung Han bhhan@postech.ac.kr Computer Vision Lab. Convolutional Neural Networks (CNNs) Structured predictions for low level computer

More information

YouTube-8M Video Classification

YouTube-8M Video Classification YouTube-8M Video Classification Alexandre Gauthier and Haiyu Lu Stanford University 450 Serra Mall Stanford, CA 94305 agau@stanford.edu hylu@stanford.edu Abstract Convolutional Neural Networks (CNNs) have

More information