CSC 578 Neural Networks and Deep Learning

Similar documents
Machine Learning 13. week

Sequence Modeling: Recurrent and Recursive Nets. By Pyry Takala 14 Oct 2015

SEMANTIC COMPUTING. Lecture 9: Deep Learning: Recurrent Neural Networks (RNNs) TU Dresden, 21 December 2018

27: Hybrid Graphical Models and Neural Networks

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

Recurrent Neural Networks

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Deep Learning Applications

Recurrent Neural Network (RNN) Industrial AI Lab.

Encoding RNNs, 48 End of sentence (EOS) token, 207 Exploding gradient, 131 Exponential function, 42 Exponential Linear Unit (ELU), 44

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Lecture 20: Neural Networks for NLP. Zubin Pahuja

Recurrent Neural Networks. Nand Kishore, Audrey Huang, Rohan Batra

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

LSTM and its variants for visual recognition. Xiaodan Liang Sun Yat-sen University

Code Mania Artificial Intelligence: a. Module - 1: Introduction to Artificial intelligence and Python:

Lecture 17: Neural Networks and Deep Learning. Instructor: Saravanan Thirumuruganathan

Modeling Sequences Conditioned on Context with RNNs

Neural Network Neurons

Deep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES

RNNs as Directed Graphical Models

A Quick Guide on Training a neural network using Keras.

DEEP LEARNING IN PYTHON. The need for optimization

Ensemble methods in machine learning. Example. Neural networks. Neural networks

Slide credit from Hung-Yi Lee & Richard Socher

House Price Prediction Using LSTM

Deep Neural Networks Applications in Handwriting Recognition

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning,

CS6220: DATA MINING TECHNIQUES

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016

LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS

Empirical Evaluation of RNN Architectures on Sentence Classification Task

The Hitchhiker s Guide to TensorFlow:

Speech and Language Processing. Daniel Jurafsky & James H. Martin. Copyright c All rights reserved. Draft of September 23, 2018.

Multinomial Regression and the Softmax Activation Function. Gary Cottrell!

Recurrent Neural Nets II

Deep Neural Networks Applications in Handwriting Recognition

Machine learning for vision. It s the features, stupid! cathedral. high-rise. Winter Roland Memisevic. Lecture 2, January 26, 2016

Technical University of Munich. Exercise 7: Neural Network Basics

2. Neural network basics

Akarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2018

Hidden Units. Sargur N. Srihari

Deep Learning for Computer Vision II

CS489/698: Intro to ML

Introduction to Deep Learning

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

CS 4510/9010 Applied Machine Learning. Neural Nets. Paula Matuszek Fall copyright Paula Matuszek 2016

Neural Network Weight Selection Using Genetic Algorithms

Multilayer Feed-forward networks

This Talk. 1) Node embeddings. Map nodes to low-dimensional embeddings. 2) Graph neural networks. Deep learning architectures for graphstructured

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Deep Learning in NLP. Horacio Rodríguez. AHLT Deep Learning 2 1

Tutorial on Keras CAP ADVANCED COMPUTER VISION SPRING 2018 KISHAN S ATHREY

Sentiment Classification of Food Reviews

Machine Learning With Python. Bin Chen Nov. 7, 2017 Research Computing Center

Asynchronous Parallel Learning for Neural Networks and Structured Models with Dense Features

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Assignment # 5. Farrukh Jabeen Due Date: November 2, Neural Networks: Backpropation

Backpropagation + Deep Learning

Deep neural networks II

CS 6501: Deep Learning for Computer Graphics. Training Neural Networks II. Connelly Barnes

Exercise: Training Simple MLP by Backpropagation. Using Netlab.

Xuedong Huang Chief Speech Scientist & Distinguished Engineer Microsoft Corporation

Deep Generative Models Variational Autoencoders

Deep Learning Basic Lecture - Complex Systems & Artificial Intelligence 2017/18 (VO) Asan Agibetov, PhD.

CS 224n: Assignment #3

Artificial Neural Networks. Introduction to Computational Neuroscience Ardi Tampuu

INTRODUCTION TO DEEP LEARNING

Deep Learning. Deep Learning. Practical Application Automatically Adding Sounds To Silent Movies

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

A Comparison of Sequence-Trained Deep Neural Networks and Recurrent Neural Networks Optical Modeling For Handwriting Recognition

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation

arxiv: v1 [cs.ai] 14 May 2007

Gate-Variants of Gated Recurrent Unit (GRU) Neural Networks

Lecture 21 : A Hybrid: Deep Learning and Graphical Models

Transition-Based Dependency Parsing with Stack Long Short-Term Memory

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Notes on Multilayer, Feedforward Neural Networks

Data Mining. Neural Networks

SGD: Stochastic Gradient Descent

Time Series prediction with Feed-Forward Neural Networks -A Beginners Guide and Tutorial for Neuroph. Laura E. Carter-Greaves

Machine Learning. MGS Lecture 3: Deep Learning

A Simple (?) Exercise: Predicting the Next Word

Natural Language Processing with Deep Learning CS224N/Ling284

MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius

CS 4510/9010 Applied Machine Learning. Deep Learning. Paula Matuszek Fall copyright Paula Matuszek 2016

MATLAB representation of neural network Outline Neural network with single-layer of neurons. Neural network with multiple-layer of neurons.

A new approach for supervised power disaggregation by using a deep recurrent LSTM network

COMP9444 Neural Networks and Deep Learning 5. Geometry of Hidden Units

Deep Learning with R. Francesca Lazzeri Data Scientist II - Microsoft, AI Research

For Monday. Read chapter 18, sections Homework:

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 15

Multi-Dimensional Recurrent Neural Networks

Back Propagation and Other Differentiation Algorithms. Sargur N. Srihari

Opening the black box of Deep Neural Networks via Information (Ravid Shwartz-Ziv and Naftali Tishby) An overview by Philip Amortila and Nicolas Gagné

Artificial Neural Networks Lecture Notes Part 5. Stephen Lucci, PhD. Part 5

Transcription:

CSC 578 Neural Networks and Deep Learning Fall 2018/19 7. Recurrent Neural Networks (Some figures adapted from NNDL book) 1

Recurrent Neural Networks 1. Recurrent Neural Networks (RNNs) 2. RNN Training 1. Loss Minimization 3. Bidirectional RNNs 4. Encoder-Decoder NNs 5. Deep RNNs 6. Recursive NNs 7. Long Short-Term Memory (LSTM) 8. LSTM Code Example 2

1 Recurrent Neural Networks Sequence models beyond one-to-one problems. Machine Learning, Tom Mitchell 3

1 Recurrent Neural Networks Recurrent Neural Networks (RNNs) use outputs of network units at time t as the input to other units at time t+1. Because of the topologies, RNNs are often used in sequential modeling such as time series data. Also the information brought from time t to t+1 is essentially the context of the preceding input, and serves as the network s internal memory. Machine Learning, Tom Mitchell 4

Sequence modeling is to predict the next value Y i from the preceding values Y 1..Y i-1 (e.g. stock market price), or to predict an output sequence Y 1..Y n for the given input sequence X 1..X n (e.g. part-of-speech tagging in NLP). There are many sequence models in Machine Learning, such as (Hidden) Markov Models, Maximum Entropy and Conditional Random Fields. In Neural Networks, sequence modeling can be depicted as: Machine Learning, Tom Mitchell 5

Back to RNN. A basic RNNs are essentially equivalent to feedforward networks since recurrence can be unfolded (in time). 6

Information from time t-1 could be from its hidden node(s) or output depending on the architecture. Accordingly the activation function will be different. Elman network Jordan network (less powerful) aa (tt) = bb + WWhh (tt 11) + UUxx (tt) h (tt) = tanh aa tt oo (tt) = cc + V h (tt) yy (tt) = softmax(oo (tt) ) aa (tt) = bb + WWoo (tt 11) + UUxx (tt) h (tt) = tanh aa tt oo (tt) = cc + V h (tt) yy (tt) = softmax(oo (tt) ) Note: b and c are bias vectors. Also the neuron activation function could be something else besides tanh, such as ReLU. 7

2 RNN Training Each sequence produces an error as the sum of the deviations of all target signals from the corresponding activations computed by the network. To measure the error at each time t, most of the loss functions used in feed-forward neural networks can be used: o Negative likelihood: o Mean squared error (MSE) o Cross-Entropy: 8 http://www.wildml.com/2015/10/recurrent-neural-networks-tutorial-part-3-backpropagation-through-time-and-vanishing-gradients/

There is also another network architecture for training, called teacher forcing. Rather than the values computed at hidden or output nodes, the information from the previous node uses the correct, target output (from the training data). 9

For loss minimization, common approaches are: 1. Gradient Descent 2.1 Loss Minimization The standard method is BackPropagation Through Time (BPTT), which is a generalization of the BP algorithm for feed-forward networks. Basically the error computed at the end of the (input) sequence, which is the sum of all errors in the sequence, is propagated backward through the ENTIRE sequence, e.g. For t=3, where Since the sequence could be long, oftentimes we clip the backward propagation by truncating the backpropagation to a few steps. Machine Learning, Tom Mitchell 10

Machine Learning, Tom Mitchell 11

Then the gradient on the various parameters become: However, gradient descent suffers from the same vanishing gradient problem as the feed-forward networks. DNN book 12

13 http://www.wildml.com/2015/10/recurrent-neural-networks-tutorial-part-3-backpropagation-through-time-and-vanishing-gradients/

2. Global optimization methods Training the weights in a neural network can be modeled as a nonlinear global optimization problem. Arbitrary global optimization techniques may then be used to minimize this target function. The most common global optimization method for training RNNs is Genetic Algorithms.. [Wikipedia] Machine Learning, Tom Mitchell 14

3 Bidirectional RNNs Bidirectional RNNs (BRNNs) combine an RNN that moves forward through time, beginning from the start of the sequence, with another RNN that moves backward through time, beginning from the end of the sequence. By using two time directions, input information from the past and future of the current time frame can be used unlike standard RNN which requires the delays for including future information. BRNNs can be trained using similar algorithms to RNNs, because the two directional neurons do not have any interactions. 15

Generally speaking, encoder-decoder networks learn the mapping from an input sequence to an output sequence. With multilayer feedforward networks, such networks are called auto associators : Input and output could be the same (to learn the identity function -> compression) or different (e.g., classification with one-hot-vector output representation) 4 Encoder-Decoder NNs 16

With recursive networks, an encoder-decoder architecture acts on sequence as the input/output unit, NOT a single unit/neuron. There is an RNN for encoding, and another RNN for decoding. A hidden state connected from the end of the input layer essentially represents the context (variable C), or a semantic summary, of the input sequence, and it is connected to the decoder RNN. 17

5 Deep RNNs RNNs can be made to deep networks in many ways. For example, 18

A recursive neural network is a kind of deep neural network created by applying the same set of weights recursively over a structured input. In the most simple architecture, nodes are combined into parents using a weight matrix that is shared across the whole network, and a non-linearity such as tanh. [Wikipedia] 6 Recursive NNs 19

If c 1 and c 2 are n-dimensional vector representation of nodes, their parent will also be an n-dimensional vector, calculated as pp 1,2 = tanh WW cc 1 ; cc 2, where W is a n x 2n matrix. Training: Typically, stochastic gradient descent (SGD) is used to train the network. The gradient is computed using backpropagation through structure (BPTS), a variant of backpropagation through time used for recurrent neural networks. [Wikipedia] 20

7 Long Short-Term Memory (LSTM) The idea for RNNs is to incorporate dependencies -- information from earlier in the input sequence, as the context or memory, in processing the current input. Long Short-Term Memory (LSTM) networks are a special kind of RNN, capable of learning long-term/distance dependencies. An LSTM network consists of LSTM units. A common LSTM unit is composed of a context/state cell, an input gate, an output gate and a forget gate. The cell remembers values over arbitrary time intervals and the three gates regulate the flow of information into and out of the cell.[wikipedia] 21

The Big Picture: LSTM maintains an internal state and produces an output. The following diagram shows an LSTM unit over three time slices: the current time slice (t), as well as the previous (t-1) and next (t+1) slice: C is the context value. Both the output and context values are always fed to the next time slice. http://localhost:8888/notebooks/temp-heaton/t81_558_class10_lstm.ipynb 22

Step-by-step walk through: (1) Forget gate controls the info coming from h t-1 for the new input x t (resulting in 0-1 by sigmoid), for all internal state cells (i s) for time t. ff tt = σσ WW ff h tt 1, xx tt + bb ff (2) Input gate applies sigmoid to control which values (i s) to keep this time t. Then tanh is applied to create the draft of the new context C t. ii tt = σσ WW ii h tt 1, xx tt + bb ii CC tt = tanh WW CC h tt 1, xx tt + bb CC http://colah.github.io/posts/2015-08-understanding-lstms/ 23

(3) The old state C t-1 is multiplied by f t, to forget the things in the previous context that we decided to forget earlier. Then we add i t C~ t. This is the new candidate values, scaled by how much we decided to update each state value. CC tt = ff tt CC tt 1 + ii tt CC tt (4) We also decide which information to output (by filtering through the output gate). Also the hidden value for this time t is set by the output value multiplied by the tanh of the new context. oo tt = σσ WW oo h tt 1, xx tt + bb oo h tt = oo tt tanh CC tt 24

Gated Recurrent Unit (GRU) http://colah.github.io/posts/2015-08-understanding-lstms/ 25

8 LSTM Code Example 26

The data has 256 input variables (a vector of size 256 for one input instance) and 1 output variable. There are 128 (hidden) LSTM units for each time step/slice. Good explanation of the number of hidden units in LSTM: https://www.quora.com/what-is-the-relationship-between-timestep-andnumber-hidden-unit-in-lstm 27

The data has 256 input variables (a vector of size 256 for one input instance) and 1 output variable. There are 128 (hidden) LSTM units for each time step/slice. The task is binary classification. Good explanation of the number of hidden units in LSTM: https://www.quora.com/what-is-the-relationship-between-timestep-andnumber-hidden-unit-in-lstm 28

The data has 1 input variable and 1 output variable. 4 (hidden) LSTM units are chosen (for each time step/slice). The task is regression. Another example: https://machinelearningmastery.com/return-sequences-and-return-states-for-lstms-in-keras/ 29