arxiv: v1 [cs.ne] 15 Nov 2018

Size: px
Start display at page:

Download "arxiv: v1 [cs.ne] 15 Nov 2018"

Transcription

1 Multi-cell LSTM Based Neural Language Model Thomas Cherian, Akshay Badola and Vineet Padmanabhan School of Computer & Information Sciences, University of Hyderabad, India arxiv: v1 [cs.ne] 15 Nov 2018 Abstract Language models, being at the heart of many NLP problems, are always of great interest to researchers. Neural language models come with the advantage of distributed representations and long range contexts. With its particular dynamics that allow the cycling of information within the network, Recurrent neural network (RNN) becomes an ideal paradigm for neural language modeling. Long Short-Term Memory (LSTM) architecture solves the inadequacies of the standard RNN in modeling long-range contexts. In spite of a plethora of RNN variants, possibility to add multiple memory cells in LSTM nodes was seldom explored. Here we propose a multi-cell node architecture for LSTMs and study its applicability for neural language modeling. The proposed multi-cell LSTM language models outperform the state-of-the-art results on well-known Penn Treebank (PTB) setup. 1 Introduction A language model is a function, or an algorithm to learn such a function that grasps the salient characteristics of the distribution of sequences of words in a natural language, allowing one to make probabilistic predictions for the next word given preceding ones. Language models are at the heart of many NLP tasks like Speech Recognition, Machine Translation, Handwriting Recognition, Parsing, Information Retrieval and Part of Speech tagging. N- gram based approaches are the most popular techniques for language modeling [Manning and Schütze, 1999; Jelinek and Mercer, 1980]. They are based on the Markov s assumptions that the probability of occurrence of a particular word in the sequence depends only on the occurrence of previous n 1 words. Though successful, they are unable to make use of the longer contexts and the word similarities. N-gram based approaches can not look into the context beyond a smaller n, say 3 in case of the trigram approach. Also they treat each word as a stand-alone entity and hence are unable to identify and utilize the similarity of words. The language models based on neural networks are termed as neural language models. These models exploit neural network s ability to learn distributed representations to fight the curse of dimensionality [Bengio and Bengio, 2000; Bengio et al., 2003]. Distributed representations also equip the neural models with the ability to utilize the word similarities. In contrast to the n-gram models; because of the use of recurrent neural networks; neural models do not require to predefine the context length. This allows the models to leverage larger contexts to make the predictions. These advantages make neural language models a popular choice despite the high computational complexity of the model. Recurrent neural networks form a very powerful and expressive family for sequential tasks. This is mainly because of the high dimensional hidden state with non-linear dynamics that enable them to remember and process past information. [Sutskever et al., 2011]. Cheap and easily computable gradients by means of back-propagation through time (BPTT) made them even more attractive. Even then RNNs failed to make their way into the mainstream research and applications for a long time due to the difficulty in effectively training them. The Vanishing and Exploding gradient problems; discussed by Bengio et al. [Bengio et al., 1994] led to the complete negligence of the model for decades until recently. In 1997, Hochreiter et al. [Hochreiter and Schmidhuber, 1997] came up with the LSTM architecture to deal with the inadequacies of the standard RNN in modeling long-range contexts. Gers et al. [Gers F. et al., 1999] further enhanced the standard LSTM model with the addition of the forget gate. LSTMs prevent the vanishing gradient problem with the use of Constant Error Carousels (CECs) [Hochreiter and Schmidhuber, 1997]; also known as memory cells. Currently LSTMs are at the core of RNN research. Many variants of basic LSTM have been proposed. However, to the best of our knowledge, these studies were centered around the efficient use of the gates in LSTM cell and none of them studied the effects of incorporating multiple memory cells into the single LSTM node. Here we move in this direction, propose a multi-cell LSTM architecture and studies its application on the neural language modeling. We propose a multi-cell node architecture for the LSTMs with the hypothesis that the availability of more information and the efficient selection of the right information will in-turn result in a better model. The use of multiple memory cells is concealed within the nodes so as to keep the network

2 dynamics same as that of standard LSTM networks. This introduces the need to have an efficient selection mechanism that selects a single value from the multi-cells such that a single output is transmitted from the node. Concretely, our contributions are two-fold: We propose a multi-cell node architecture for LSTM network and investigate the optimum strategies for selecting a particular cell value or combining all the memory cell values so as to output a single value from the node. Further, we apply the multi-cell LSTM for neural language modeling and compare its performance with the state-of-the-art Zaremba s models [Zaremba et al., 2014]. 2 Neural Language Modeling Neural Language Modeling came into limelight with the Neural Probabilistic Language Model (hereafter referred as NPLM) proposed by Bengio et al. [Bengio et al., 2003]. It deals with the challenges of n-gram language models and the Curse of dimensionality [Bengio and Bengio, 2000] by simultaneously learning the distributed representation of words and the probability distribution for the sequences of words expressed in terms of these representations. The feed-forward neural network based NPLM architecture makes use of a single hidden layer with hyperbolic tangent activation to calculate the probability distribution. The softmax output layer gives the probabilities. Generalization is obtained with the use of distributed word representations. Arisoy et al. [Arisoy et al., 2012] extended Bengio s model by adding more hidden layers (upto 4). They observe that the deeper architectures have the potential to improve over the models with single hidden layer. NPLM was exceptionally successful and the further investigations [Goodman, 2001] show that this single model outperforms the mixture of several other models which are based on other techniques. Even though Bengio et al. were successful in incorporating the idea of word similarities into the model [Bengio et al., 2003]; owing to the use of fixed length context that needs to be specified prior to the training; NPLM was unable to make use of larger contexts. In contrast to the feed forward neural networks, the recurrent neural networks do not limit the context size. With recurrent connections, theoretically, information can cycle in the network for arbitrarily long time. In [Mikolov et al., 2010] Mikolov et al. study the Recurrent Neural Network based Language Model (hereafter referred as RNNLM). RNNLM makes use of simple recurrent neural network or the Elman network architecture [Elman, 1990]. The network consists of an input layer, a hidden layer (also known as the context layer) with sigmoid activation and the output softmax layer. Input to the network is the concatenation of one-hot-vector encoding of the current word and the output from neurons in context layer at previous time-step. Learning rate annealing and early stopping techniques are used for efficient training. Even though the preliminary results on toy tasks were promising, Mikolov et al. [Mikolov et al., 2010] conclude that it is impossible for simple RNNs trained with gradient descent to truly capture long context information, mainly because of the Vanishing gradient and Exploding gradient problems. One way to deal with this inefficacy of gradient descent approach to learn long-range context information in the simple RNN is to use an enhanced learning technique. Martens et al. [Martens and Sutskever, 2011] have shown that the use of HF optimization technique can solve the vanishing gradient problem in RNNs. In [Sutskever et al., 2011] Sutskever et al. propose a character-level language model, making use of the Hessian-free optimization to overcome the difficulties associated with RNN training. Another popular technique to solve the training difficulties with the simple RNN is to modify the network to include the memory units that are specially structured to store information over longer periods. Such networks are termed as Long Short Term Memory (LSTM) networks. In [Zaremba et al., 2014] Zaremba et al. makes use of a LSTM network for language modeling (hereafter referred as LSTMLM). LSTMLM uses two LSTM layers as the hidden network. As in the case of [Bengio et al., 2003], the network simultaneously learns the distributed word embeddings and the probability function of the sequences of words represented in terms of these embeddings. Softmax output layer gives the probability distribution. The network is trained by means of back-propagation and stochastic gradient descent. Learning rate annealing is used for the efficient training. Gradient Clipping [Pascanu et al., 2013] is employed to prevent the exploding gradient problem. Zaremba et al. were also successful in identifying and employing an efficient way to apply dropout regularization to the RNNs. Because of this, the dropout-regularized LSTMLM was able to perform much better than the non-regularized RNN based language models. Our model is similar to the neural language model by Zaremba et al. [Zaremba et al., 2014]. Instead of the standard single cell LSTMs used by Zaremba et al., we use our multi-cell LSTMs. 3 Node structure in LSTM and the proposed multi-cell LSTM node Figure 1 shows the structure of a LSTM node. Core of the LSTM node is the memory cell which stores its state. Information stored in the memory cell is updated with the help of two gates. The Input gate regulates the amount of information that gets added into the cell state, whereas the Forget gate controls the amount of information that is removed from the current cell state. The input, forget and output gates use the sigmoid activation. The hyperbolic tangent of the cell state, controlled by the Output gate is given as the output of the node. a t = tanh (W c x t + U c h t 1 + b c ) (1) i t = σ(w i x t + U i h t 1 + b i ) (2)

3 Figure 1: LSTM cell structural diagram f t = σ(w f x t + U f h t 1 + b f ) (3) o t = σ(w o x t + U o h t 1 + b o ) (4) c t = i t.a t + f t.c t 1 (5) h t = o t. tanh (c t ) (6) The equations (1) to (6) describe how a LSTM cell is updated at every time-step t. In these equations, x t is the input to the memory cell at time t. i t, f t and o t are the input, forget and output gate values at time t. a t is the modulated input to the cell at time t. c t denotes the cell state value at time t, where as c t 1 is the cell state value in the previous time step. h t is the output of the cell at time t. Because of this particular node structure, LSTMs are able to store the information for longer time-steps and hence can solve the vanishing gradient problem. This makes LSTMs a popular and powerful choice for langauge modeling applications, where there is a need to store long contextual information. However, the exploding gradients problem still prevails in the network. Gradient clipping [Pascanu et al., 2013] can be used to prevent this. Figure 2 gives the architecture of a multi-cell LSTM node. Instead of a single memory cell in a standard LSTM node, each multi-cell LSTM node holds m memory cells. The gate values and the input to the cells are calculated as in the case of standard LSTM node. These are then broadcast to the memory cells and each cell updates its values using equation (7). c t i = i t.a t + f t.c t 1 i (7) Where c t i denotes the content of memory-cell i at time t. c t 1 i is the content of the memory-cell at time t 1. Updated cell values flow into the selection module, where they are combined or a single value is selected based on the underlying strategy. Regulated by the output gate, the hyperbolic tangent of the selected/combined cell value moves into output. The number of memory cells in each node, m is a hyper-parameter of the network that can be tuned while training. In the next section, we discuss various strategies that can be used for the selection of a single value from the multi-cells. Figure 2: Multi-cell LSTM node 4 Strategies for combining multiple memory cell values in a multi-cell LSTM node As in the case of standard LSTM node, mutli-cell LSTM node takes two inputs (x t, h t 1 ) and provides a single output (h t ). To get the single output from multiple memory cells, we need to have a selection mechanism. Here we discuss the different strategies applied to select a particular memory cell/combine all the cell values, so as to pass a single value to the output. From the selection module, we obtain the effective cell value (c eff ). Output of the node is then computed using Equation 8. h t = o t. tanh (c t eff ) (8) 4.1 Simple Mean The effective cell value, c eff is calculated as the mean value of the multiple memory cells in the node. Hyperbolic tangent of c eff regulated by the output gate forms the output of the node. c t eff = 1 m c t i (9) m i=1 where m is the number of memory cells in the node. c t i denotes the content of memory-cell i at time t. 4.2 Weighted Sum Each cell is given a static weight. First cell is given the maximum weighting and the other cells are weighted in decreasing order with a constant decay from the previous cell weight. c eff is calculated as the sum of these weighted cells. m c t eff = c t i.w i (10) i=1 where w i is the static weight for the i th memory cell. 4.3 Random Selection From the m memory cells available, we randomly pick a memory cell. This is our effective cell value (c eff ). c t eff = Random(C t ) (11) where C t is a one dimensional vector holding the values of all memory cells in the node at time t.

4 4.4 Max Pooling Similar to the max pooling layer used in Convolutional Neural Networks (CNN), selection mechanism of the node picks the maximum value stored in its memory cells. This is the effective cell value of the node. 4.5 Min-Max Pooling c t eff = Max(C t ) (12) We define a threshold value for output gate. If the value is less than the threshold, we take the minimum of the cell values. We go with the maximum of cell values, if the gate value is greater than the threshold. { Min(C c t t ), if o t < o eff = thr Max(C t ), otherwise (13) where o thr is the threshold value for the output gate. It is a hyper parameter that can be tuned by training. 4.6 Learnable Weights for the cells We associate each memory cell with a trainable weight that gets updated in the backward pass. Now we can combine all the weighted cells or select a particular cell by applying any of the above strategies. Here we demonstrate it with the Max Pooling technique. Maximum of the weighted cell values is taken as the effective cell value. c t eff = Max(C t.w t cell) (14) where Wcell t is the learned weights for the memory cells at time t. 5 Network Architecture Architecture of our network is same as the one used by Zaremba et al. [Zaremba et al., 2014] for language modeling. Input is given through an embedding layer having the same size as that of the hidden layers. It is followed by two hidden layers comprising of LSTM nodes. Fully connected softmax layer gives the output probability distribution. 6 Results and Observations Here we discuss in detail about the datasets, experimental setup, results obtained and the comparisons with the results of Zaremba et al. [Zaremba et al., 2014]. 6.1 Dataset Penn Tree Bank, popularly known as the PTB dataset [Marcus et al., 1994] was used for the experiments. Originally a corpus of English sentences with linguistic structure annotations, it is a collection of 2,499 stories sampled from the three year Wall Street Journal (WSJ) collection of 98,732 stories. A variant of the original dataset, distributed with the Chainer package [Tokui et al., 2015] for Python was used for our experiments. The corpus with a vocabulary of words is divided into training, test and validation sets consisting of 929k, 82k and 73k words respectively. 6.2 Evaluation Metric measure was used as the performance evaluation metric. Most popular performance metric for the language modeling systems, is a measure of how well the probability model or the probability distribution predicts a sample. Lower the, better the model is. A low perplexity indicates that the model is good at predicting the sample. The perplexity of a probability distribution is defined as follows: P erplexity = 2 H(p) = 2 x p(x)log2p(x) (15) where, H(p) is the entropy of the distribution and the x ranges over the events. 6.3 Experimental Setup Multi-cell LSTM based language models (hereafter referred as MLSTM-LM) of three different sizes were trained. These are denoted as the small MLSTM-LM, medium MLSTM-LM and the large MLSTM-LM. All of these have two layers of multi-cell LSTMs and are unrolled for 35 steps. We follow the conventions used by Zaremba et al. in [Zaremba et al., 2014]. Hidden states are initialized to 0. We also ensure the statefulness of the model by which the final hidden state of the current mini-batch is used as the initial hidden state of the subsequent mini-batch. Mini-batch size is kept as 20. As in [Zaremba et al., 2014], we apply dropout on all the connections other than the recurrent connections in LSTM layers. The small MLSTM-LM has 200 nodes per hidden layer. Dropout rate of 0.4 is used for these networks. Initial learning rate of 1 is used. The medium MLSTM-LM has 650 nodes in the hidden layers. Dropout rate of 0.5 and initial learning rate of 1.2 were used. For the large MLSTM-LMs having 1500 nodes per hidden layer, initial learning rate of 1.2 and dropout rate of 0.65 were used. All the models use the Algorithm 1 for learning rate annealing. At the end of each epoch, perplexity is computed for the validation set. This is used to anneal the learning rate. Learning rate annealing algorithm is run after each epoch. We define epochs to wait as the number of epochs to wait for an improvement in the perplexity, before annealing the learning rate. minimum reduction is the threshold improvement in perplexity from the previously recorded perplexity value to consider an improvement. minimum learning rate defines the lower bound for learning rate annealing. given chances holds the number of epochs since last improvement in the metric. If there is no considerable improvement for perplexity from the previous epoch, given chances is incremented. If given chances crosses the epochs to wait, we reduce the learning rate by multiplying it with the decay factor. We define Learning rate decay as 0.5, epochs to wait as 2, minimum reduction as 2 and minimum learning rate as Results and Discussion Experiments were conducted with all three models, choosing different selection strategies listed in Section 4. Hyper

5 Algorithm 1: Procedure: Learning rate annealing Input: Learning rate decay, minimum learning rate, minimum reduction, epochs to wait Output: updated learning rate begin if current perplexity is greater than (previous perplexity- minimum reduction) then if given-chances less than the epochs to wait then Increment given-chances. else new learning rate = max(minimum learning rate, learning rate * learning rate decay) given-chances =0 end else given-chances = 0 end Return updated learning rate end parameters; including the number of memory cells per node; were fine-tuned with the help of repeated runs. Table 1 lists the performance of Large MLSTM-LM under different selection strategies. All these models have 10 memory cells per LSTM node. As we can see, the selection strategies namely Simple Mean, Random Selection, Max Pooling and the Min- Max Pooling gives almost same results, whereas the other two strategies; Weighted Sum and Learnable Weights for the cells; do not give comparable results. We obtain the top performing model when the Max Pooling strategy is used. Selection Strategy Simple Mean Random Selection Max Pooling Min-Max Pooling Table 1: Comparison of the multi-cell selection strategies Table 2 presents the experiments to choose the optimum number of memory cells. All these experiments were carried out with the Large MLSTM-LM network, using Max Pooling strategy. Models perform more or less in a similar fashion despite the change in number of memory cells. We recorded the lowest perplexity with the model having 10 memory cells per LSTM node. Table 3 presents the performance of top models. As evident from the table, MLSTM-LM models outperform the models from [Zaremba et al., 2014] on both validation and test sets. While replicating the test of Zaremba et al. using our learning rate annealing algorithm; instead of their original strategy; we obtain better results as compared to the results reported by them in [Zaremba et al., 2014]. Figure 3 gives the Epoch v/s set perplexity plot for the top MLSTM-LM and replicated result of Large regularized LSTM model by Zaremba et al. [Zaremba et al., 2014]. Figure 4 is the Epoch v/s set perplexity plot for small, medium and large MLSTM-LM models. While Zaremba et al. [Zaremba et al., 2014] reports the saturation of large model at Epoch 55 and the medium model at Epoch 39, all of our models attain saturation well before that. All of our models were able to outperform the models from [Zaremba et al., 2014] with a training time of approximately 30 epochs. Interestingly the replication of Zaremba s experiment on the large regularized LSTM model using our learning rate annealing algorithm also saturates early. This also shows the effectiveness of our annealing algorithm. Number of Memory-Cells Table 2: Model performance on varying the number of memory cells in the nodes Model Number of nodes in LSTM layers Results by Zaremba et al. [Zaremba et al., 2014] Non-regularized LSTM Medium-regularized LSTM Large regularized LSTM Top models from experiments Small MLSTM-LM Medium MLSTM-LM Large MLSTM-LM Large regularized LSTM(Replicated) Regularized LSTM(Replicated) Table 3: Comparison of the top models with the results of Zaremba et.al.

6 Epoch v/s Plot Epoch v/s Plot 120 MLSTM-LM LSTM-LM Hidden Layer = 1 Hidden Layer = 2 Hidden Layer = Epochs Figure 3: Epoch v/s plot for LSTM-LM and MLSTM-LM Epochs Figure 5: Epoch v/s plot for MLSTM-LMs with different number of hidden layers Epoch v/s Plot Large MLSTM-LM Medium MLSTM-LM Small MLSTM-LM Experiments were also conducted to study the effects of weight-decay and max-norm regularization techniques. Srivastava et al. [Srivastava et al., 2014] suggest that the maxnorm regularization which thresholds the L2 norm of the weight vectors between 3 and 4 enhances the performance of dropout-regularized networks. But we were unable to obtain any considerable improvement in the model performance with the max-norm regularization. As reported by Mikolov et al. [Mikolov et al., 2010], regularization of network to penalize large weights by weight-decay did not provide any significant improvements Epochs Figure 4: Epoch v/s plot for Small, Medium and Large MLSTM-LMs 6.5 Miscellaneous Experiments Experiments were also conducted to check the performance of the models on varying number of hidden layers and embedding size. Table 4 summarizes the experiments with different number of hidden layers. Figure 5 gives the Epoch v/s perplexity plot for the models. Even though we have a fairly good result with a single hidden layer model, we obtain the optimum result with the network having two hidden layers. Performance deteriorates when more hidden layers are added. It might be an indication of the requirement for better regularization techniques as we add more hidden layers. More research is required on this and we leave it for future work. Table 5 describes the experiments conducted to study the effect of embedding size on the model performance. Small and Large MLSTM-LMs were used for the study. As evident from the table, both the models give best results when the embedding size equals the number of nodes in the hidden layers. Performance of the models deteriorates as we use embedding size different from the hidden layer size. 7 Conclusion In this paper, we have proposed the multi-cell LSTM architecture and discussed the various selection strategies to select a particular memory cell or combine all the memory cells of a node. We have applied the model successfully for language modeling. Effectiveness of different selection strategies and Hidden Layers Table 4: Model performance on varying the number of hidden layers Model Units Embedding size Small Small Small Large Large Large Table 5: of the models on varying the embedding size

7 the effect of varying the number of memory cells on model performance were also studied. Our MLSTM-LM models were able to outperform the state-of-the-art Zaremba s models and attain saturation early as compared to the results reported by Zaremba et al. [Zaremba et al., 2014]. Multi-cell LSTMs need to be investigated more for its performance on the other applications of standard LSTMs. We have explored only the maximum and average functions for selecting a particular value out of the multi-cells. Further ideas from signal theory can also be used with the use of linear filters, as average is also a linear filter. MLSTM-LMs should be applied on the various language modeling applications like speech recognition in order to measure its practicality. References [Arisoy et al., 2012] Ebru Arisoy, Tara N. Sainath, Brian Kingsbury, and Bhuvana Ramabhadran. Deep neural network language models. In Proceedings of the NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-gram Model? On the Future of Language Modeling for HLT, WLM 12, pages 20 28, Stroudsburg, PA, USA, Association for Computational Linguistics. [Bengio and Bengio, 2000] S. Bengio and Y. Bengio. Taking on the curse of dimensionality in joint distributions using neural networks. Trans. Neur. Netw., 11(3): , May [Bengio et al., 1994] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difficult. Trans. Neur. Netw., 5(2): , March [Bengio et al., 2003] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. A neural probabilistic language model. J. Mach. Learn. Res., 3: , March [Elman, 1990] Jeffrey L. Elman. Finding structure in time. Cognitive Science, 14(2): , [Gers F. et al., 1999] A. Gers F., J. Schmidhuber, and F. Cummins. Learning to forget: Continual prediction with lstm. Technical report, [Goodman, 2001] Joshua Goodman. A bit of progress in language modeling. CoRR, cs.cl/ , [Hochreiter and Schmidhuber, 1997] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Comput., 9(8): , November [Jelinek and Mercer, 1980] Fred Jelinek and Robert L. Mercer. Interpolated estimation of Markov source parameters from sparse data. In Edzard S. Gelsema and Laveen N. Kanal, editors, Proceedings, Workshop on Pattern Recognition in Practice, pages North Holland, Amsterdam, [Manning and Schütze, 1999] Christopher D. Manning and Hinrich Schütze. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA, USA, [Marcus et al., 1994] Mitchell Marcus, Grace Kim, Mary Ann Marcinkiewicz, Robert MacIntyre, Ann Bies, Mark Ferguson, Karen Katz, and Britta Schasberger. The penn treebank: Annotating predicate argument structure. In Proceedings of the Workshop on Human Language Technology, HLT 94, pages , Stroudsburg, PA, USA, Association for Computational Linguistics. [Martens and Sutskever, 2011] James Martens and Ilya Sutskever. Learning recurrent neural networks with hessian-free optimization. In ICML, [Mikolov et al., 2010] Tomas Mikolov, Martin Karafiat, Lukas Burget, Jan Cernocky, and Sanjeev Khudanpur. Recurrent neural network based language model. In INTER- SPEECH, pages ISCA, [Pascanu et al., 2013] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, ICML 13, pages III 1310 III JMLR.org, [Srivastava et al., 2014] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15(1): , January [Sutskever et al., 2011] Ilya Sutskever, James Martens, and Geoffrey Hinton. Generating text with recurrent neural networks. In Lise Getoor and Tobias Scheffer, editors, Proceedings of the 28th International Conference on Machine Learning (ICML-11), ICML 11, pages , New York, NY, USA, June ACM. [Tokui et al., 2015] Seiya Tokui, Kenta Oono, Shohei Hido, and Justin Clayton. Chainer: a next-generation open source framework for deep learning. In Proceedings of Workshop on Machine Learning Systems (LearningSys) in The Twenty-ninth Annual Conference on Neural Information Processing Systems (NIPS), [Zaremba et al., 2014] Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization. CoRR, abs/ , 2014.

Gate-Variants of Gated Recurrent Unit (GRU) Neural Networks

Gate-Variants of Gated Recurrent Unit (GRU) Neural Networks Gate-Variants of Gated Recurrent Unit (GRU) Neural Networks Rahul Dey and Fathi M. Salem Circuits, Systems, and Neural Networks (CSANN) LAB Department of Electrical and Computer Engineering Michigan State

More information

End-To-End Spam Classification With Neural Networks

End-To-End Spam Classification With Neural Networks End-To-End Spam Classification With Neural Networks Christopher Lennan, Bastian Naber, Jan Reher, Leon Weber 1 Introduction A few years ago, the majority of the internet s network traffic was due to spam

More information

On the Efficiency of Recurrent Neural Network Optimization Algorithms

On the Efficiency of Recurrent Neural Network Optimization Algorithms On the Efficiency of Recurrent Neural Network Optimization Algorithms Ben Krause, Liang Lu, Iain Murray, Steve Renals University of Edinburgh Department of Informatics s17005@sms.ed.ac.uk, llu@staffmail.ed.ac.uk,

More information

Deep Learning for Computer Vision II

Deep Learning for Computer Vision II IIIT Hyderabad Deep Learning for Computer Vision II C. V. Jawahar Paradigm Shift Feature Extraction (SIFT, HoG, ) Part Models / Encoding Classifier Sparrow Feature Learning Classifier Sparrow L 1 L 2 L

More information

Empirical Evaluation of RNN Architectures on Sentence Classification Task

Empirical Evaluation of RNN Architectures on Sentence Classification Task Empirical Evaluation of RNN Architectures on Sentence Classification Task Lei Shen, Junlin Zhang Chanjet Information Technology lorashen@126.com, zhangjlh@chanjet.com Abstract. Recurrent Neural Networks

More information

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS Puyang Xu, Ruhi Sarikaya Microsoft Corporation ABSTRACT We describe a joint model for intent detection and slot filling based

More information

Machine Learning 13. week

Machine Learning 13. week Machine Learning 13. week Deep Learning Convolutional Neural Network Recurrent Neural Network 1 Why Deep Learning is so Popular? 1. Increase in the amount of data Thanks to the Internet, huge amount of

More information

Deep Learning with Tensorflow AlexNet

Deep Learning with Tensorflow   AlexNet Machine Learning and Computer Vision Group Deep Learning with Tensorflow http://cvml.ist.ac.at/courses/dlwt_w17/ AlexNet Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton, "Imagenet classification

More information

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling Authors: Junyoung Chung, Caglar Gulcehre, KyungHyun Cho and Yoshua Bengio Presenter: Yu-Wei Lin Background: Recurrent Neural

More information

LSTM for Language Translation and Image Captioning. Tel Aviv University Deep Learning Seminar Oran Gafni & Noa Yedidia

LSTM for Language Translation and Image Captioning. Tel Aviv University Deep Learning Seminar Oran Gafni & Noa Yedidia 1 LSTM for Language Translation and Image Captioning Tel Aviv University Deep Learning Seminar Oran Gafni & Noa Yedidia 2 Part I LSTM for Language Translation Motivation Background (RNNs, LSTMs) Model

More information

Bayesian model ensembling using meta-trained recurrent neural networks

Bayesian model ensembling using meta-trained recurrent neural networks Bayesian model ensembling using meta-trained recurrent neural networks Luca Ambrogioni l.ambrogioni@donders.ru.nl Umut Güçlü u.guclu@donders.ru.nl Yağmur Güçlütürk y.gucluturk@donders.ru.nl Julia Berezutskaya

More information

CSC 578 Neural Networks and Deep Learning

CSC 578 Neural Networks and Deep Learning CSC 578 Neural Networks and Deep Learning Fall 2018/19 7. Recurrent Neural Networks (Some figures adapted from NNDL book) 1 Recurrent Neural Networks 1. Recurrent Neural Networks (RNNs) 2. RNN Training

More information

Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks

Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks Zelun Luo Department of Computer Science Stanford University zelunluo@stanford.edu Te-Lin Wu Department of

More information

Asynchronous Parallel Learning for Neural Networks and Structured Models with Dense Features

Asynchronous Parallel Learning for Neural Networks and Structured Models with Dense Features Asynchronous Parallel Learning for Neural Networks and Structured Models with Dense Features Xu SUN ( 孙栩 ) Peking University xusun@pku.edu.cn Motivation Neural networks -> Good Performance CNN, RNN, LSTM

More information

LSTM with Working Memory

LSTM with Working Memory LSTM with Working Memory Andrew Pulver Department of Computer Science University at Albany Email: apulver@albany.edu Siwei Lyu Department of Computer Science University at Albany Email: slyu@albany.edu

More information

SEMANTIC COMPUTING. Lecture 9: Deep Learning: Recurrent Neural Networks (RNNs) TU Dresden, 21 December 2018

SEMANTIC COMPUTING. Lecture 9: Deep Learning: Recurrent Neural Networks (RNNs) TU Dresden, 21 December 2018 SEMANTIC COMPUTING Lecture 9: Deep Learning: Recurrent Neural Networks (RNNs) Dagmar Gromann International Center For Computational Logic TU Dresden, 21 December 2018 Overview Handling Overfitting Recurrent

More information

Study of Residual Networks for Image Recognition

Study of Residual Networks for Image Recognition Study of Residual Networks for Image Recognition Mohammad Sadegh Ebrahimi Stanford University sadegh@stanford.edu Hossein Karkeh Abadi Stanford University hosseink@stanford.edu Abstract Deep neural networks

More information

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah Reference Most of the slides are taken from the third chapter of the online book by Michael Nielson: neuralnetworksanddeeplearning.com

More information

Sentiment Classification of Food Reviews

Sentiment Classification of Food Reviews Sentiment Classification of Food Reviews Hua Feng Department of Electrical Engineering Stanford University Stanford, CA 94305 fengh15@stanford.edu Ruixi Lin Department of Electrical Engineering Stanford

More information

House Price Prediction Using LSTM

House Price Prediction Using LSTM House Price Prediction Using LSTM Xiaochen Chen Lai Wei The Hong Kong University of Science and Technology Jiaxin Xu ABSTRACT In this paper, we use the house price data ranging from January 2004 to October

More information

Slide credit from Hung-Yi Lee & Richard Socher

Slide credit from Hung-Yi Lee & Richard Socher Slide credit from Hung-Yi Lee & Richard Socher 1 Review Word Vector 2 Word2Vec Variants Skip-gram: predicting surrounding words given the target word (Mikolov+, 2013) CBOW (continuous bag-of-words): predicting

More information

Layerwise Interweaving Convolutional LSTM

Layerwise Interweaving Convolutional LSTM Layerwise Interweaving Convolutional LSTM Tiehang Duan and Sargur N. Srihari Department of Computer Science and Engineering The State University of New York at Buffalo Buffalo, NY 14260, United States

More information

Deep Learning Applications

Deep Learning Applications October 20, 2017 Overview Supervised Learning Feedforward neural network Convolution neural network Recurrent neural network Recursive neural network (Recursive neural tensor network) Unsupervised Learning

More information

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning,

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning, A Acquisition function, 298, 301 Adam optimizer, 175 178 Anaconda navigator conda command, 3 Create button, 5 download and install, 1 installing packages, 8 Jupyter Notebook, 11 13 left navigation pane,

More information

Deep Learning Workshop. Nov. 20, 2015 Andrew Fishberg, Rowan Zellers

Deep Learning Workshop. Nov. 20, 2015 Andrew Fishberg, Rowan Zellers Deep Learning Workshop Nov. 20, 2015 Andrew Fishberg, Rowan Zellers Why deep learning? The ImageNet Challenge Goal: image classification with 1000 categories Top 5 error rate of 15%. Krizhevsky, Alex,

More information

LOW-RANK MATRIX FACTORIZATION FOR DEEP NEURAL NETWORK TRAINING WITH HIGH-DIMENSIONAL OUTPUT TARGETS

LOW-RANK MATRIX FACTORIZATION FOR DEEP NEURAL NETWORK TRAINING WITH HIGH-DIMENSIONAL OUTPUT TARGETS LOW-RANK MATRIX FACTORIZATION FOR DEEP NEURAL NETWORK TRAINING WITH HIGH-DIMENSIONAL OUTPUT TARGETS Tara N. Sainath, Brian Kingsbury, Vikas Sindhwani, Ebru Arisoy, Bhuvana Ramabhadran IBM T. J. Watson

More information

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu Natural Language Processing CS 6320 Lecture 6 Neural Language Models Instructor: Sanda Harabagiu In this lecture We shall cover: Deep Neural Models for Natural Language Processing Introduce Feed Forward

More information

Lecture 7: Neural network acoustic models in speech recognition

Lecture 7: Neural network acoustic models in speech recognition CS 224S / LINGUIST 285 Spoken Language Processing Andrew Maas Stanford University Spring 2017 Lecture 7: Neural network acoustic models in speech recognition Outline Hybrid acoustic modeling overview Basic

More information

Natural Language Processing with Deep Learning CS224N/Ling284

Natural Language Processing with Deep Learning CS224N/Ling284 Natural Language Processing with Deep Learning CS224N/Ling284 Lecture 8: Recurrent Neural Networks Christopher Manning and Richard Socher Organization Extra project office hour today after lecture Overview

More information

XES Tensorflow Process Prediction using the Tensorflow Deep-Learning Framework

XES Tensorflow Process Prediction using the Tensorflow Deep-Learning Framework XES Tensorflow Process Prediction using the Tensorflow Deep-Learning Framework Demo Paper Joerg Evermann 1, Jana-Rebecca Rehse 2,3, and Peter Fettke 2,3 1 Memorial University of Newfoundland 2 German Research

More information

Long Short-Term Memory Recurrent Neural Network Architectures for Large Scale Acoustic Modeling

Long Short-Term Memory Recurrent Neural Network Architectures for Large Scale Acoustic Modeling INTERSPEECH 2014 Long Short-Term Memory Recurrent Neural Network Architectures for Large Scale Acoustic Modeling Haşim Sak, Andrew Senior, Françoise Beaufays Google, USA {hasim,andrewsenior,fsb@google.com}

More information

Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks

Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks Boya Peng Department of Computer Science Stanford University boya@stanford.edu Zelun Luo Department of Computer

More information

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Neural Networks CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Biological and artificial neural networks Feed-forward neural networks Single layer

More information

UNNORMALIZED EXPONENTIAL AND NEURAL NETWORK LANGUAGE MODELS. Abhinav Sethy, Stanley Chen, Ebru Arisoy, Bhuvana Ramabhadran

UNNORMALIZED EXPONENTIAL AND NEURAL NETWORK LANGUAGE MODELS. Abhinav Sethy, Stanley Chen, Ebru Arisoy, Bhuvana Ramabhadran UNNORMALIZED EXPONENTIAL AND NEURAL NETWORK LANGUAGE MODELS Abhinav Sethy, Stanley Chen, Ebru Arisoy, Bhuvana Ramabhadran IBM T.J. Watson Research Center, Yorktown Heights, NY, USA ABSTRACT Model M, an

More information

27: Hybrid Graphical Models and Neural Networks

27: Hybrid Graphical Models and Neural Networks 10-708: Probabilistic Graphical Models 10-708 Spring 2016 27: Hybrid Graphical Models and Neural Networks Lecturer: Matt Gormley Scribes: Jakob Bauer Otilia Stretcu Rohan Varma 1 Motivation We first look

More information

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla DEEP LEARNING REVIEW Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature 2015 -Presented by Divya Chitimalla What is deep learning Deep learning allows computational models that are composed of multiple

More information

LSTM and its variants for visual recognition. Xiaodan Liang Sun Yat-sen University

LSTM and its variants for visual recognition. Xiaodan Liang Sun Yat-sen University LSTM and its variants for visual recognition Xiaodan Liang xdliang328@gmail.com Sun Yat-sen University Outline Context Modelling with CNN LSTM and its Variants LSTM Architecture Variants Application in

More information

Deep Learning With Noise

Deep Learning With Noise Deep Learning With Noise Yixin Luo Computer Science Department Carnegie Mellon University yixinluo@cs.cmu.edu Fan Yang Department of Mathematical Sciences Carnegie Mellon University fanyang1@andrew.cmu.edu

More information

Hierarchical Multiscale Gated Recurrent Unit

Hierarchical Multiscale Gated Recurrent Unit Hierarchical Multiscale Gated Recurrent Unit Alex Pine alex.pine@nyu.edu Sean D Rosario seandrosario@nyu.edu Abstract The hierarchical multiscale long-short term memory recurrent neural network (HM- LSTM)

More information

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3 Neural Network Optimization and Tuning 11-785 / Spring 2018 / Recitation 3 1 Logistics You will work through a Jupyter notebook that contains sample and starter code with explanations and comments throughout.

More information

Detecting Fraudulent Behavior Using Recurrent Neural Networks

Detecting Fraudulent Behavior Using Recurrent Neural Networks Computer Security Symposium 2016 11-13 October 2016 Detecting Fraudulent Behavior Using Recurrent Neural Networks Yoshihiro Ando 1,2,a),b) Hidehito Gomi 2,c) Hidehiko Tanaka 1,d) Abstract: Due to an increase

More information

Recurrent Neural Networks. Nand Kishore, Audrey Huang, Rohan Batra

Recurrent Neural Networks. Nand Kishore, Audrey Huang, Rohan Batra Recurrent Neural Networks Nand Kishore, Audrey Huang, Rohan Batra Roadmap Issues Motivation 1 Application 1: Sequence Level Training 2 Basic Structure 3 4 Variations 5 Application 3: Image Classification

More information

Weighted Convolutional Neural Network. Ensemble.

Weighted Convolutional Neural Network. Ensemble. Weighted Convolutional Neural Network Ensemble Xavier Frazão and Luís A. Alexandre Dept. of Informatics, Univ. Beira Interior and Instituto de Telecomunicações Covilhã, Portugal xavierfrazao@gmail.com

More information

A Comparison of Sequence-Trained Deep Neural Networks and Recurrent Neural Networks Optical Modeling For Handwriting Recognition

A Comparison of Sequence-Trained Deep Neural Networks and Recurrent Neural Networks Optical Modeling For Handwriting Recognition A Comparison of Sequence-Trained Deep Neural Networks and Recurrent Neural Networks Optical Modeling For Handwriting Recognition Théodore Bluche, Hermann Ney, Christopher Kermorvant SLSP 14, Grenoble October

More information

Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction

Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction by Noh, Hyeonwoo, Paul Hongsuck Seo, and Bohyung Han.[1] Presented : Badri Patro 1 1 Computer Vision Reading

More information

Discriminative Training with Perceptron Algorithm for POS Tagging Task

Discriminative Training with Perceptron Algorithm for POS Tagging Task Discriminative Training with Perceptron Algorithm for POS Tagging Task Mahsa Yarmohammadi Center for Spoken Language Understanding Oregon Health & Science University Portland, Oregon yarmoham@ohsu.edu

More information

Fast Parallel Training of Neural Language Models

Fast Parallel Training of Neural Language Models Fast Parallel Training of Neural Language Models Tong Xiao, Jingbo Zhu, Tongran Liu, Chunliang Zhang NiuTrans Lab., Northeastern University, Shenyang 110819, China Institute of Psychology (CAS), Beijing

More information

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li Learning to Match Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li 1. Introduction The main tasks in many applications can be formalized as matching between heterogeneous objects, including search, recommendation,

More information

Combining Neural Networks and Log-linear Models to Improve Relation Extraction

Combining Neural Networks and Log-linear Models to Improve Relation Extraction Combining Neural Networks and Log-linear Models to Improve Relation Extraction Thien Huu Nguyen and Ralph Grishman Computer Science Department, New York University {thien,grishman}@cs.nyu.edu Outline Relation

More information

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

COMP 551 Applied Machine Learning Lecture 14: Neural Networks COMP 551 Applied Machine Learning Lecture 14: Neural Networks Instructor: (jpineau@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp551 Unless otherwise noted, all material posted for this course

More information

Sequence Modeling: Recurrent and Recursive Nets. By Pyry Takala 14 Oct 2015

Sequence Modeling: Recurrent and Recursive Nets. By Pyry Takala 14 Oct 2015 Sequence Modeling: Recurrent and Recursive Nets By Pyry Takala 14 Oct 2015 Agenda Why Recurrent neural networks? Anatomy and basic training of an RNN (10.2, 10.2.1) Properties of RNNs (10.2.2, 8.2.6) Using

More information

Domain-Aware Sentiment Classification with GRUs and CNNs

Domain-Aware Sentiment Classification with GRUs and CNNs Domain-Aware Sentiment Classification with GRUs and CNNs Guangyuan Piao 1(B) and John G. Breslin 2 1 Insight Centre for Data Analytics, Data Science Institute, National University of Ireland Galway, Galway,

More information

Lecture 20: Neural Networks for NLP. Zubin Pahuja

Lecture 20: Neural Networks for NLP. Zubin Pahuja Lecture 20: Neural Networks for NLP Zubin Pahuja zpahuja2@illinois.edu courses.engr.illinois.edu/cs447 CS447: Natural Language Processing 1 Today s Lecture Feed-forward neural networks as classifiers simple

More information

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

COMP 551 Applied Machine Learning Lecture 16: Deep Learning COMP 551 Applied Machine Learning Lecture 16: Deep Learning Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise noted, all

More information

Deep Learning. Volker Tresp Summer 2014

Deep Learning. Volker Tresp Summer 2014 Deep Learning Volker Tresp Summer 2014 1 Neural Network Winter and Revival While Machine Learning was flourishing, there was a Neural Network winter (late 1990 s until late 2000 s) Around 2010 there

More information

Transition-Based Dependency Parsing with Stack Long Short-Term Memory

Transition-Based Dependency Parsing with Stack Long Short-Term Memory Transition-Based Dependency Parsing with Stack Long Short-Term Memory Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews, Noah A. Smith Association for Computational Linguistics (ACL), 2015 Presented

More information

Image-Sentence Multimodal Embedding with Instructive Objectives

Image-Sentence Multimodal Embedding with Instructive Objectives Image-Sentence Multimodal Embedding with Instructive Objectives Jianhao Wang Shunyu Yao IIIS, Tsinghua University {jh-wang15, yao-sy15}@mails.tsinghua.edu.cn Abstract To encode images and sentences into

More information

Outline GF-RNN ReNet. Outline

Outline GF-RNN ReNet. Outline Outline Gated Feedback Recurrent Neural Networks. arxiv1502. Introduction: RNN & Gated RNN Gated Feedback Recurrent Neural Networks (GF-RNN) Experiments: Character-level Language Modeling & Python Program

More information

Recurrent Neural Networks

Recurrent Neural Networks Recurrent Neural Networks Javier Béjar Deep Learning 2018/2019 Fall Master in Artificial Intelligence (FIB-UPC) Introduction Sequential data Many problems are described by sequences Time series Video/audio

More information

Encoding RNNs, 48 End of sentence (EOS) token, 207 Exploding gradient, 131 Exponential function, 42 Exponential Linear Unit (ELU), 44

Encoding RNNs, 48 End of sentence (EOS) token, 207 Exploding gradient, 131 Exponential function, 42 Exponential Linear Unit (ELU), 44 A Activation potential, 40 Annotated corpus add padding, 162 check versions, 158 create checkpoints, 164, 166 create input, 160 create train and validation datasets, 163 dropout, 163 DRUG-AE.rel file,

More information

Improving Language Modeling using Densely Connected Recurrent Neural Networks

Improving Language Modeling using Densely Connected Recurrent Neural Networks Improving Language Modeling using Densely Connected Recurrent Neural Networks Fréderic Godin, Joni Dambre and Wesley De Neve IDLab - ELIS, Ghent University - imec, Ghent, Belgium firstname.lastname@ugent.be

More information

arxiv: v1 [cs.cl] 1 Feb 2016

arxiv: v1 [cs.cl] 1 Feb 2016 Efficient Character-level Document Classification by Combining Convolution and Recurrent Layers arxiv:1602.00367v1 [cs.cl] 1 Feb 2016 Yijun Xiao Center for Data Sciences, New York University ryjxiao@nyu.edu

More information

arxiv: v1 [cs.cl] 30 Jan 2018

arxiv: v1 [cs.cl] 30 Jan 2018 ACCELERATING RECURRENT NEURAL NETWORK LANGUAGE MODEL BASED ONLINE SPEECH RECOGNITION SYSTEM Kyungmin Lee, Chiyoun Park, Namhoon Kim, and Jaewon Lee DMC R&D Center, Samsung Electronics, Seoul, Korea {k.m.lee,

More information

Automated Crystal Structure Identification from X-ray Diffraction Patterns

Automated Crystal Structure Identification from X-ray Diffraction Patterns Automated Crystal Structure Identification from X-ray Diffraction Patterns Rohit Prasanna (rohitpr) and Luca Bertoluzzi (bertoluz) CS229: Final Report 1 Introduction X-ray diffraction is a commonly used

More information

Machine Learning With Python. Bin Chen Nov. 7, 2017 Research Computing Center

Machine Learning With Python. Bin Chen Nov. 7, 2017 Research Computing Center Machine Learning With Python Bin Chen Nov. 7, 2017 Research Computing Center Outline Introduction to Machine Learning (ML) Introduction to Neural Network (NN) Introduction to Deep Learning NN Introduction

More information

Neural Network Models for Text Classification. Hongwei Wang 18/11/2016

Neural Network Models for Text Classification. Hongwei Wang 18/11/2016 Neural Network Models for Text Classification Hongwei Wang 18/11/2016 Deep Learning in NLP Feedforward Neural Network The most basic form of NN Convolutional Neural Network (CNN) Quite successful in computer

More information

Iterative CKY parsing for Probabilistic Context-Free Grammars

Iterative CKY parsing for Probabilistic Context-Free Grammars Iterative CKY parsing for Probabilistic Context-Free Grammars Yoshimasa Tsuruoka and Jun ichi Tsujii Department of Computer Science, University of Tokyo Hongo 7-3-1, Bunkyo-ku, Tokyo 113-0033 CREST, JST

More information

NEURAL network learning is typically slow, where back

NEURAL network learning is typically slow, where back 1 Training Simplification and Model Simplification for Deep Learning: A Minimal Effort Back Propagation Method Xu Sun, Xuancheng Ren, Shuming Ma, Bingzhen Wei, Wei Li, and Houfeng Wang arxiv:1711.06528v1

More information

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 18: Deep learning and Vision: Convolutional neural networks Teacher: Gianni A. Di Caro DEEP, SHALLOW, CONNECTED, SPARSE? Fully connected multi-layer feed-forward perceptrons: More powerful

More information

Lecture 13. Deep Belief Networks. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom

Lecture 13. Deep Belief Networks. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Lecture 13 Deep Belief Networks Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson Group IBM T.J. Watson Research Center Yorktown Heights, New York, USA {picheny,bhuvana,stanchen,nussbaum}@us.ibm.com

More information

SYNTHETIC GRADIENT METHODS

SYNTHETIC GRADIENT METHODS SYNTHETIC GRADIENT METHODS WITH VIRTUAL FORWARD-BACKWARD NETWORKS Takeru Miyato 1,2, Daisuke Okanohara 1, Shin-ichi Maeda 3, Koyama Masanori 4 {miyato, hillbig}@preferred.jp, ichi@sys.i.kyoto-u.ac.jp,

More information

Keras: Handwritten Digit Recognition using MNIST Dataset

Keras: Handwritten Digit Recognition using MNIST Dataset Keras: Handwritten Digit Recognition using MNIST Dataset IIT PATNA January 31, 2018 1 / 30 OUTLINE 1 Keras: Introduction 2 Installing Keras 3 Keras: Building, Testing, Improving A Simple Network 2 / 30

More information

LSTM: An Image Classification Model Based on Fashion-MNIST Dataset

LSTM: An Image Classification Model Based on Fashion-MNIST Dataset LSTM: An Image Classification Model Based on Fashion-MNIST Dataset Kexin Zhang, Research School of Computer Science, Australian National University Kexin Zhang, U6342657@anu.edu.au Abstract. The application

More information

Neural Networks for Machine Learning. Lecture 15a From Principal Components Analysis to Autoencoders

Neural Networks for Machine Learning. Lecture 15a From Principal Components Analysis to Autoencoders Neural Networks for Machine Learning Lecture 15a From Principal Components Analysis to Autoencoders Geoffrey Hinton Nitish Srivastava, Kevin Swersky Tijmen Tieleman Abdel-rahman Mohamed Principal Components

More information

Neural Network Joint Language Model: An Investigation and An Extension With Global Source Context

Neural Network Joint Language Model: An Investigation and An Extension With Global Source Context Neural Network Joint Language Model: An Investigation and An Extension With Global Source Context Ruizhongtai (Charles) Qi Department of Electrical Engineering, Stanford University rqi@stanford.edu Abstract

More information

Stochastic Gradient Descent Algorithm in the Computational Network Toolkit

Stochastic Gradient Descent Algorithm in the Computational Network Toolkit Stochastic Gradient Descent Algorithm in the Computational Network Toolkit Brian Guenter, Dong Yu, Adam Eversole, Oleksii Kuchaiev, Michael L. Seltzer Microsoft Corporation One Microsoft Way Redmond, WA

More information

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017 COMP9444 Neural Networks and Deep Learning 7. Image Processing COMP9444 17s2 Image Processing 1 Outline Image Datasets and Tasks Convolution in Detail AlexNet Weight Initialization Batch Normalization

More information

ImageNet Classification with Deep Convolutional Neural Networks

ImageNet Classification with Deep Convolutional Neural Networks ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky Ilya Sutskever Geoffrey Hinton University of Toronto Canada Paper with same name to appear in NIPS 2012 Main idea Architecture

More information

Keras: Handwritten Digit Recognition using MNIST Dataset

Keras: Handwritten Digit Recognition using MNIST Dataset Keras: Handwritten Digit Recognition using MNIST Dataset IIT PATNA February 9, 2017 1 / 24 OUTLINE 1 Introduction Keras: Deep Learning library for Theano and TensorFlow 2 Installing Keras Installation

More information

Convolutional Neural Networks

Convolutional Neural Networks Lecturer: Barnabas Poczos Introduction to Machine Learning (Lecture Notes) Convolutional Neural Networks Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications.

More information

Parsing as Language Modeling

Parsing as Language Modeling Parsing as Language Modeling Do Kook Choe Brown University Providence, RI dc65@cs.brown.edu Eugene Charniak Brown University Providence, RI ec@cs.brown.edu Abstract We recast syntactic parsing as a language

More information

Perceptron: This is convolution!

Perceptron: This is convolution! Perceptron: This is convolution! v v v Shared weights v Filter = local perceptron. Also called kernel. By pooling responses at different locations, we gain robustness to the exact spatial location of image

More information

Motivation Dropout Fast Dropout Maxout References. Dropout. Auston Sterling. January 26, 2016

Motivation Dropout Fast Dropout Maxout References. Dropout. Auston Sterling. January 26, 2016 Dropout Auston Sterling January 26, 2016 Outline Motivation Dropout Fast Dropout Maxout Co-adaptation Each unit in a neural network should ideally compute one complete feature. Since units are trained

More information

CS489/698: Intro to ML

CS489/698: Intro to ML CS489/698: Intro to ML Lecture 14: Training of Deep NNs Instructor: Sun Sun 1 Outline Activation functions Regularization Gradient-based optimization 2 Examples of activation functions 3 5/28/18 Sun Sun

More information

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic SEMANTIC COMPUTING Lecture 8: Introduction to Deep Learning Dagmar Gromann International Center For Computational Logic TU Dresden, 7 December 2018 Overview Introduction Deep Learning General Neural Networks

More information

arxiv: v1 [cs.ai] 14 May 2007

arxiv: v1 [cs.ai] 14 May 2007 Multi-Dimensional Recurrent Neural Networks Alex Graves, Santiago Fernández, Jürgen Schmidhuber IDSIA Galleria 2, 6928 Manno, Switzerland {alex,santiago,juergen}@idsia.ch arxiv:0705.2011v1 [cs.ai] 14 May

More information

Deep Learning on Graphs

Deep Learning on Graphs Deep Learning on Graphs with Graph Convolutional Networks Hidden layer Hidden layer Input Output ReLU ReLU, 22 March 2017 joint work with Max Welling (University of Amsterdam) BDL Workshop @ NIPS 2016

More information

Machine Learning. MGS Lecture 3: Deep Learning

Machine Learning. MGS Lecture 3: Deep Learning Dr Michel F. Valstar http://cs.nott.ac.uk/~mfv/ Machine Learning MGS Lecture 3: Deep Learning Dr Michel F. Valstar http://cs.nott.ac.uk/~mfv/ WHAT IS DEEP LEARNING? Shallow network: Only one hidden layer

More information

DropConnect Regularization Method with Sparsity Constraint for Neural Networks

DropConnect Regularization Method with Sparsity Constraint for Neural Networks Chinese Journal of Electronics Vol.25, No.1, Jan. 2016 DropConnect Regularization Method with Sparsity Constraint for Neural Networks LIAN Zifeng 1,JINGXiaojun 1, WANG Xiaohan 2, HUANG Hai 1, TAN Youheng

More information

Scalable Trigram Backoff Language Models

Scalable Trigram Backoff Language Models Scalable Trigram Backoff Language Models Kristie Seymore Ronald Rosenfeld May 1996 CMU-CS-96-139 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 This material is based upon work

More information

Semantic Word Embedding Neural Network Language Models for Automatic Speech Recognition

Semantic Word Embedding Neural Network Language Models for Automatic Speech Recognition Semantic Word Embedding Neural Network Language Models for Automatic Speech Recognition Kartik Audhkhasi, Abhinav Sethy Bhuvana Ramabhadran Watson Multimodal Group IBM T. J. Watson Research Center Motivation

More information

A Deep Learning primer

A Deep Learning primer A Deep Learning primer Riccardo Zanella r.zanella@cineca.it SuperComputing Applications and Innovation Department 1/21 Table of Contents Deep Learning: a review Representation Learning methods DL Applications

More information

A Simple (?) Exercise: Predicting the Next Word

A Simple (?) Exercise: Predicting the Next Word CS11-747 Neural Networks for NLP A Simple (?) Exercise: Predicting the Next Word Graham Neubig Site https://phontron.com/class/nn4nlp2017/ Are These Sentences OK? Jane went to the store. store to Jane

More information

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides Deep Learning in Visual Recognition Thanks Da Zhang for the slides Deep Learning is Everywhere 2 Roadmap Introduction Convolutional Neural Network Application Image Classification Object Detection Object

More information

Learning Binary Code with Deep Learning to Detect Software Weakness

Learning Binary Code with Deep Learning to Detect Software Weakness KSII The 9 th International Conference on Internet (ICONI) 2017 Symposium. Copyright c 2017 KSII 245 Learning Binary Code with Deep Learning to Detect Software Weakness Young Jun Lee *, Sang-Hoon Choi

More information

arxiv: v3 [cs.lg] 29 Oct 2018

arxiv: v3 [cs.lg] 29 Oct 2018 COMPOSITIONAL CODING CAPSULE NETWORK WITH K-MEANS ROUTING FOR TEXT CLASSIFICATION Hao Ren, Hong Lu School of Computer Science, Fudan University Shanghai, China arxiv:1810.09177v3 [cs.lg] 29 Oct 2018 ABSTRACT

More information

Rotation Invariance Neural Network

Rotation Invariance Neural Network Rotation Invariance Neural Network Shiyuan Li Abstract Rotation invariance and translate invariance have great values in image recognition. In this paper, we bring a new architecture in convolutional neural

More information

Deep Neural Networks:

Deep Neural Networks: Deep Neural Networks: Part II Convolutional Neural Network (CNN) Yuan-Kai Wang, 2016 Web site of this course: http://pattern-recognition.weebly.com source: CNN for ImageClassification, by S. Lazebnik,

More information

Residual Networks And Attention Models. cs273b Recitation 11/11/2016. Anna Shcherbina

Residual Networks And Attention Models. cs273b Recitation 11/11/2016. Anna Shcherbina Residual Networks And Attention Models cs273b Recitation 11/11/2016 Anna Shcherbina Introduction to ResNets Introduced in 2015 by Microsoft Research Deep Residual Learning for Image Recognition (He, Zhang,

More information

Deep Learning Cook Book

Deep Learning Cook Book Deep Learning Cook Book Robert Haschke (CITEC) Overview Input Representation Output Layer + Cost Function Hidden Layer Units Initialization Regularization Input representation Choose an input representation

More information