Layerwise Interweaving Convolutional LSTM

Similar documents
Image Captioning with Object Detection and Localization

Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks

Domain-Aware Sentiment Classification with GRUs and CNNs

End-To-End Spam Classification With Neural Networks

Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction

16-785: Integrated Intelligence in Robotics: Vision, Language, and Planning. Spring 2018 Lecture 14. Image to Text

Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks

LSTM for Language Translation and Image Captioning. Tel Aviv University Deep Learning Seminar Oran Gafni & Noa Yedidia

A Torch Library for Action Recognition and Detection Using CNNs and LSTMs

Empirical Evaluation of RNN Architectures on Sentence Classification Task

Deep Learning with Tensorflow AlexNet

A FRAMEWORK OF EXTRACTING MULTI-SCALE FEATURES USING MULTIPLE CONVOLUTIONAL NEURAL NETWORKS. Kuan-Chuan Peng and Tsuhan Chen

LSTM and its variants for visual recognition. Xiaodan Liang Sun Yat-sen University

Real-time Object Detection CS 229 Course Project

3D Densely Convolutional Networks for Volumetric Segmentation. Toan Duc Bui, Jitae Shin, and Taesup Moon

Supplementary material for Analyzing Filters Toward Efficient ConvNet

arxiv: v1 [cs.lg] 20 Apr 2018

Channel Locality Block: A Variant of Squeeze-and-Excitation

Multimodal Gesture Recognition using Multi-stream Recurrent Neural Network

arxiv: v1 [cs.cv] 6 Jul 2016

Deep Learning. Deep Learning. Practical Application Automatically Adding Sounds To Silent Movies

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides

Rotation Invariance Neural Network

Densely Connected Bidirectional LSTM with Applications to Sentence Classification

A Novel Representation and Pipeline for Object Detection

Real-time detection and tracking of pedestrians in CCTV images using a deep convolutional neural network

Computer Vision Lecture 16

Tutorial on Keras CAP ADVANCED COMPUTER VISION SPRING 2018 KISHAN S ATHREY

Computer Vision Lecture 16

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Face Recognition A Deep Learning Approach

A Deep Learning primer

Groupout: A Way to Regularize Deep Convolutional Neural Network

Computer Vision Lecture 16

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation

Proceedings of the International MultiConference of Engineers and Computer Scientists 2018 Vol I IMECS 2018, March 14-16, 2018, Hong Kong

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

Machine Learning. MGS Lecture 3: Deep Learning

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla

Multi-Glance Attention Models For Image Classification

Recurrent Neural Networks

arxiv: v1 [cs.cv] 20 Dec 2016

Comparison of Default Patient Surface Model Estimation Methods

A Novel Weight-Shared Multi-Stage Network Architecture of CNNs for Scale Invariance

An Exploration of Computer Vision Techniques for Bird Species Classification

Su et al. Shape Descriptors - III

27: Hybrid Graphical Models and Neural Networks

REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION

Machine Learning 13. week

CS839: Probabilistic Graphical Models. Lecture 22: The Attention Mechanism. Theo Rekatsinas

3D CONVOLUTIONAL NEURAL NETWORK WITH MULTI-MODEL FRAMEWORK FOR ACTION RECOGNITION

CSC 578 Neural Networks and Deep Learning

Convolutional Neural Networks

Deep Neural Networks:

Outline GF-RNN ReNet. Outline

Transition-Based Dependency Parsing with Stack Long Short-Term Memory

Contextual Dropout. Sam Fok. Abstract. 1. Introduction. 2. Background and Related Work

Convolutional Sequence to Sequence Learning. Denis Yarats with Jonas Gehring, Michael Auli, David Grangier, Yann Dauphin Facebook AI Research

Sentiment Classification of Food Reviews

A Deep Learning Framework for Authorship Classification of Paintings

International Journal of Computer Engineering and Applications, Volume XII, Special Issue, September 18,

Content-Based Image Recovery

Volumetric and Multi-View CNNs for Object Classification on 3D Data Supplementary Material

Modeling Sequences Conditioned on Context with RNNs

Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neural Networks

Ryerson University CP8208. Soft Computing and Machine Intelligence. Naive Road-Detection using CNNS. Authors: Sarah Asiri - Domenic Curro

ImageNet Classification with Deep Convolutional Neural Networks

Advanced Video Analysis & Imaging

Deep Learning based Authorship Identification

CNN-based Human Body Orientation Estimation for Robotic Attendant

MULTI-SCALE OBJECT DETECTION WITH FEATURE FUSION AND REGION OBJECTNESS NETWORK. Wenjie Guan, YueXian Zou*, Xiaoqun Zhou

Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

CEA LIST s participation to the Scalable Concept Image Annotation task of ImageCLEF 2015

SIIM 2017 Scientific Session Analytics & Deep Learning Part 2 Friday, June 2 8:00 am 9:30 am

Cost-alleviative Learning for Deep Convolutional Neural Network-based Facial Part Labeling

Part Localization by Exploiting Deep Convolutional Networks

How to Develop Encoder-Decoder LSTMs

SocialML: machine learning for social media video creators

Aggregating Frame-level Features for Large-Scale Video Classification

Code Mania Artificial Intelligence: a. Module - 1: Introduction to Artificial intelligence and Python:

Deep MIML Network. Abstract

Inception and Residual Networks. Hantao Zhang. Deep Learning with Python.

Neural Network Models for Text Classification. Hongwei Wang 18/11/2016

Cross-domain Deep Encoding for 3D Voxels and 2D Images

Deep Learning Explained Module 4: Convolution Neural Networks (CNN or Conv Nets)

Real-time object detection towards high power efficiency

Deep Learning and Its Applications

Scaling Deep Learning on Multiple In-Memory Processors

Structured Attention Networks

Representation Learning using Multi-Task Deep Neural Networks for Semantic Classification and Information Retrieval

Stacked Denoising Autoencoders for Face Pose Normalization

ABC-CNN: Attention Based CNN for Visual Question Answering

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Convolutional Sequence to Sequence Non-intrusive Load Monitoring

Mocha.jl. Deep Learning in Julia. Chiyuan Zhang CSAIL, MIT

Recurrent Neural Networks and Transfer Learning for Action Recognition

END-TO-END CHINESE TEXT RECOGNITION

arxiv: v1 [cs.cv] 14 Jul 2017

Perceptron: This is convolution!

Transcription:

Layerwise Interweaving Convolutional LSTM Tiehang Duan and Sargur N. Srihari Department of Computer Science and Engineering The State University of New York at Buffalo Buffalo, NY 14260, United States tiehangd@buffalo.edu, srihari@cedar.buffalo.edu Abstract. A deep network structure is formed with LSTM layer and convolutional layer interweaves with each other. The Layerwise Interweaving Convolutional LSTM(LIC-LSTM) enhanced the feature extraction ability of LSTM stack and is capable for versatile sequential data modeling. Its unique network structure allows it to extract higher level features with sequential information involved. Experiment results show the model achieves higher accuracy and shoulders lower perplexity on sequential data modeling tasks compared with state of art LSTM models. 1 Introduction LSTM is a specific Recurrent Neural Network(RNN) which uses explicit gates to control the magnitude of input, cell state and hidden state at each time step and it has the unique property of maintaining long term dependencies in the memory cell [6]. It is playing an important role in sequential learning and has been applied to multiple fields of natural language processing including sentiment analysis [8], semantic analysis [7] and neural machine translation [1] [5] etc. The feature extraction of LSTM can be enhanced with windowed convolution operation, which is used widely in computer vision and image processing [4]. Researchers have contributed innovative and pioneering work for possible combinations of LSTM and CNN. Y. Kim et al. built the CNN-LSTM character level model for NLP problems [3]; S. Vosoughi et al. combined CNN and LSTM model to encode tweets into vectors for semantic and sentiment analysis [7]; and J. Donahue et al. used similar structure for visual recognition and description [2]. Previous works adopt the network structure with convolutional layer(or several convolutional layers) at the bottom and LSTM layers on the top, and feature extraction process is completed before it is fed into the LSTM stack. In this paper, we introduced dense convolutional layer between all adjacent LSTM layers in the network and forms the Layerwise Interweaving Convolutional LSTM(LIC- LSTM). 2 Network Structure and Mechanism The network structure is shown in Fig. 1. In the net, Convolution layer and LSTM layer interweaves with each other. The features extracted in the filters 1

are combined and fed into the next LSTM layer, and the encoded output from the LSTM layer is convoluted to extract higher level features containing sequential information. The evolving of feature from low level to higher level is more homogeneous. While max pooling or mean pooling is often applied in CNN to reduce feature dimensionality, no pooling is introduced in our network based on two considerations: (1) pooling disrupts the sequential order [10], (2) LSTM is capable to deal with long term sequences and shrinking of time steps is not necessary. Long Short Term Memory Extracted Features Dense Convolution Long Short Term Memory Extracted Features Dense Convolution Sequential Input Data Fig. 1: Structure of Layerwise Interweaving Convolutional LSTM(LIC-LSTM) Valid convolution is used for each convolutional layer in the forward pass. Meaning of labels is summarized in Table 1. The extracted feature after the convolutional layer is C ( D ) fi n = W [f] n jk I (i+c j)(d+1 k) j=1 k=1 (1) where n is the n th filter and i denotes the i th time step. The extracted features from N different filters are combined together to form the new input to LSTM layer, in which the following transformations are performed i gate = sigmoid(w ih H s 1 + W ix I s + b i ) (2) o gate = sigmoid(w oh H s 1 + W ox I s + b o ) (3) f gate = sigmoid(w fh H s 1 + W fx I s + b f ) (4) 2

Convolution Window Extracted Features [D, C] (L C + 1) Sequential Data [D, C] (L C + 1) Combined Input to LSTM [D, L] [N, (L C + 1)] [D, C] (L C + 1) LSTM Input Gate LSTM Forget Gate Hidden States [CS, L C + 1] LSTM Output Gate Cell Data [CS, L C + 1] LSTM Input Matrix Fig. 2: Workflow in one layer of LIC-LSTM, dimension of feature and data is illustrated in the brackets. Sequential Input Y o u l o o k g r e a t Target o u l o o k g r e a t Fig. 3: Illustration of Character Prediction C s = f gate C s 1 + i gate tanh(w in [I s ; H s 1 ]) (5) H s = o gate tanh(c s ) (6) where i gate, f gate, o gate are referred to as input, forget and output gates. Weights of the LSTM are W jk with j {i, f, o, in} and k {h, x}. C s denotes the memory cell and H s denotes the hidden state. The evolving of feature for one layer of LIC-LSTM is shown in Fig. 2. 3 Network Training and Performance We reveal the detailed training process of the network with character prediction task. For the majority of related works, the training of CNN stack and LSTM stack are processed separately. This approach, while easy to implement and fine tune, can not guarantee overall optimization for the whole combined network. 3

Table 1: Summary of Label Meaning Label Meaning D Character Representation Dimension L Number of Time Steps I Input Data W[f] Filter Weight C Convolution Layer Window Size N Number of Filters CS Dimension of Memory Cell and Hidden State in LSTM (a) (b) Fig. 4: Performance Comparison of Different Models. (a) Cross Entropy Cost, (b) Error Rate. In LIC-LSTM, the convolutional layers and LSTM layers interweave with each other, and in the forward backward training phase, the gradients are propagated through the whole network simultaneously and concurrently. Such approach is designed to achieve overall optimization in the whole network and better performance. The drawback is requisite for more elegant implementation and careful fine tuning. We used 24,000 English phrases for training and 8,000 English phrases for testing. The dataset is from LightNet toolbox introduced in [9]. The labels are created by moving the training phrase one character left. An illustration of the task is shown in Fig. 3. A total of 67 different characters are included and each character is encoded into one hot vector. The number of time steps is set to 34 and the original phrases are either truncated or concatenated with zeros. SGD is used for the convolutional layer optimization and Adam is used for the LSTM layer. As some of the characters like spaces are much more frequent than other characters, we adopt soft error rate in our experiment. The prediction is classified as correct if the label appears in the top K predictions (K = 5 in our implementation). The momentum is set to be 0.5 initially and then increased 4

(a) (b) Fig. 5: Visualization of Filter Weight. (a) First Convolution Layer, (b) Second Convolution Layer. (a) (b) Fig. 6: Performance comparison with different filter length. (a) 2 layer LIC-LSTM cross entropy cost, (b) 2 layer LIC-LSTM error rate. to 0.9 after 20 mini batches. The initial low momentum allows the model to rapidly reach functioning region and the following high momentum avoids the model being trapped in local optima. We found high momentum(0.9) is necessary for LIC-LSTM with 2 layers to achieve satisfying result. The model training performance is shown in Fig. 4. For the same number of layers, the LIC-LSTM model is achieving higher accuracy than pure LSTM model. And the performance of deep LSTM model lies between shallow LIC-LSTM model and deep LIC- LSTM model. The deep LIC-LSTM model achieves 16.2% error rate on testing dataset compared to 17.8% achieved by deep LSTM model. The LIC-LSTM model is enduring less cross entropy cost with larger filter size as shown in Fig. 6, while error rate remains stable, which shows the model performance is robust on variating filter size. 5

4 Conclusion In this paper, we formed LIC-LSTM with convolutional layer and LSTM layer interweaving with each other. LIC-LSTM combines the feature extracting ability of CNN and sequential data modeling of LSTM homogeneously. Experiment results show the model achieves higher accuracy and shoulders lower cross entropy cost on sequential data modeling tasks compared with state of art LSTM models. The LIC-LSTM can serve as encoders in encoder-decoder model with natural language processing and neural machine translation tasks. References 1. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473 (2014), http://arxiv.org/abs/1409.0473 2. Donahue, J., Hendricks, L.A., Rohrbach, M., Venugopalan, S., Guadarrama, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. IEEE Transactions on Pattern Analysis and Machine Intelligence PP(99), 1 1 (2016) 3. Kim, Y., Jernite, Y., Sontag, D., Rush, A.M.: Character-aware neural language models. CoRR abs/1508.06615 (2015), http://arxiv.org/abs/1508.06615 4. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 25, pp. 1097 1105. Curran Associates, Inc. (2012), http://papers.nips.cc/paper/4824-imagenetclassification-with-deep-convolutional-neural-networks.pdf 5. Luong, M., Pham, H., Manning, C.D.: Effective approaches to attentionbased neural machine translation. CoRR abs/1508.04025 (2015), http://arxiv.org/abs/1508.04025 6. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 27, pp. 3104 3112. Curran Associates, Inc. (2014), http://papers.nips.cc/paper/5346-sequenceto-sequence-learning-with-neural-networks.pdf 7. Vosoughi, S., Vijayaraghavan, P., Roy, D.: Tweet2vec: Learning tweet embeddings using character-level CNN-LSTM encoder-decoder. CoRR abs/1607.07514 (2016), http://arxiv.org/abs/1607.07514 8. Wang, J., Yu, L.C., Lai, K.R., Zhang, X.: Dimensional sentiment analysis using a regional cnn-lstm model. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 225 230 (2016) 9. Ye, C., Zhao, C., Yang, Y., Fermüller, C., Aloimonos, Y.: Lightnet: A versatile, standalone matlab-based environment for deep learning. CoRR abs/1605.02766 (2016), http://arxiv.org/abs/1605.02766 10. Zhou, C., Sun, C., Liu, Z., Lau, F.C.M.: A C-LSTM neural network for text classification. CoRR abs/1511.08630 (2015), http://arxiv.org/abs/1511.08630 6