CSC 578 Neural Networks and Deep Learning

CSC 578 Neural Networks and Deep Learning Fall 2018/19 7. Recurrent Neural Networks (Some figures adapted from NNDL book) 1

Recurrent Neural Networks 1. Recurrent Neural Networks (RNNs) 2. RNN Training 1. Loss Minimization 3. Bidirectional RNNs 4. Encoder-Decoder NNs 5. Deep RNNs 6. Recursive NNs 7. Long Short-Term Memory (LSTM) 8. LSTM Code Example 2

1 Recurrent Neural Networks Sequence models beyond one-to-one problems. Machine Learning, Tom Mitchell 3

1 Recurrent Neural Networks Recurrent Neural Networks (RNNs) use outputs of network units at time t as the input to other units at time t+1. Because of the topologies, RNNs are often used in sequential modeling such as time series data. Also the information brought from time t to t+1 is essentially the context of the preceding input, and serves as the network s internal memory. Machine Learning, Tom Mitchell 4

Sequence modeling is to predict the next value Y i from the preceding values Y 1..Y i-1 (e.g. stock market price), or to predict an output sequence Y 1..Y n for the given input sequence X 1..X n (e.g. part-of-speech tagging in NLP). There are many sequence models in Machine Learning, such as (Hidden) Markov Models, Maximum Entropy and Conditional Random Fields. In Neural Networks, sequence modeling can be depicted as: Machine Learning, Tom Mitchell 5

Back to RNN. A basic RNNs are essentially equivalent to feedforward networks since recurrence can be unfolded (in time). 6

Information from time t-1 could be from its hidden node(s) or output depending on the architecture. Accordingly the activation function will be different. Elman network Jordan network (less powerful) aa (tt) = bb + WWhh (tt 11) + UUxx (tt) h (tt) = tanh aa tt oo (tt) = cc + V h (tt) yy (tt) = softmax(oo (tt) ) aa (tt) = bb + WWoo (tt 11) + UUxx (tt) h (tt) = tanh aa tt oo (tt) = cc + V h (tt) yy (tt) = softmax(oo (tt) ) Note: b and c are bias vectors. Also the neuron activation function could be something else besides tanh, such as ReLU. 7

2 RNN Training Each sequence produces an error as the sum of the deviations of all target signals from the corresponding activations computed by the network. To measure the error at each time t, most of the loss functions used in feed-forward neural networks can be used: o Negative likelihood: o Mean squared error (MSE) o Cross-Entropy: 8 http://www.wildml.com/2015/10/recurrent-neural-networks-tutorial-part-3-backpropagation-through-time-and-vanishing-gradients/

There is also another network architecture for training, called teacher forcing. Rather than the values computed at hidden or output nodes, the information from the previous node uses the correct, target output (from the training data). 9

For loss minimization, common approaches are: 1. Gradient Descent 2.1 Loss Minimization The standard method is BackPropagation Through Time (BPTT), which is a generalization of the BP algorithm for feed-forward networks. Basically the error computed at the end of the (input) sequence, which is the sum of all errors in the sequence, is propagated backward through the ENTIRE sequence, e.g. For t=3, where Since the sequence could be long, oftentimes we clip the backward propagation by truncating the backpropagation to a few steps. Machine Learning, Tom Mitchell 10

Machine Learning, Tom Mitchell 11

Then the gradient on the various parameters become: However, gradient descent suffers from the same vanishing gradient problem as the feed-forward networks. DNN book 12

13 http://www.wildml.com/2015/10/recurrent-neural-networks-tutorial-part-3-backpropagation-through-time-and-vanishing-gradients/

2. Global optimization methods Training the weights in a neural network can be modeled as a nonlinear global optimization problem. Arbitrary global optimization techniques may then be used to minimize this target function. The most common global optimization method for training RNNs is Genetic Algorithms.. [Wikipedia] Machine Learning, Tom Mitchell 14

3 Bidirectional RNNs Bidirectional RNNs (BRNNs) combine an RNN that moves forward through time, beginning from the start of the sequence, with another RNN that moves backward through time, beginning from the end of the sequence. By using two time directions, input information from the past and future of the current time frame can be used unlike standard RNN which requires the delays for including future information. BRNNs can be trained using similar algorithms to RNNs, because the two directional neurons do not have any interactions. 15

Generally speaking, encoder-decoder networks learn the mapping from an input sequence to an output sequence. With multilayer feedforward networks, such networks are called auto associators : Input and output could be the same (to learn the identity function -> compression) or different (e.g., classification with one-hot-vector output representation) 4 Encoder-Decoder NNs 16

With recursive networks, an encoder-decoder architecture acts on sequence as the input/output unit, NOT a single unit/neuron. There is an RNN for encoding, and another RNN for decoding. A hidden state connected from the end of the input layer essentially represents the context (variable C), or a semantic summary, of the input sequence, and it is connected to the decoder RNN. 17

5 Deep RNNs RNNs can be made to deep networks in many ways. For example, 18

A recursive neural network is a kind of deep neural network created by applying the same set of weights recursively over a structured input. In the most simple architecture, nodes are combined into parents using a weight matrix that is shared across the whole network, and a non-linearity such as tanh. [Wikipedia] 6 Recursive NNs 19

If c 1 and c 2 are n-dimensional vector representation of nodes, their parent will also be an n-dimensional vector, calculated as pp 1,2 = tanh WW cc 1 ; cc 2, where W is a n x 2n matrix. Training: Typically, stochastic gradient descent (SGD) is used to train the network. The gradient is computed using backpropagation through structure (BPTS), a variant of backpropagation through time used for recurrent neural networks. [Wikipedia] 20

7 Long Short-Term Memory (LSTM) The idea for RNNs is to incorporate dependencies -- information from earlier in the input sequence, as the context or memory, in processing the current input. Long Short-Term Memory (LSTM) networks are a special kind of RNN, capable of learning long-term/distance dependencies. An LSTM network consists of LSTM units. A common LSTM unit is composed of a context/state cell, an input gate, an output gate and a forget gate. The cell remembers values over arbitrary time intervals and the three gates regulate the flow of information into and out of the cell.[wikipedia] 21

The Big Picture: LSTM maintains an internal state and produces an output. The following diagram shows an LSTM unit over three time slices: the current time slice (t), as well as the previous (t-1) and next (t+1) slice: C is the context value. Both the output and context values are always fed to the next time slice. http://localhost:8888/notebooks/temp-heaton/t81_558_class10_lstm.ipynb 22

Step-by-step walk through: (1) Forget gate controls the info coming from h t-1 for the new input x t (resulting in 0-1 by sigmoid), for all internal state cells (i s) for time t. ff tt = σσ WW ff h tt 1, xx tt + bb ff (2) Input gate applies sigmoid to control which values (i s) to keep this time t. Then tanh is applied to create the draft of the new context C t. ii tt = σσ WW ii h tt 1, xx tt + bb ii CC tt = tanh WW CC h tt 1, xx tt + bb CC http://colah.github.io/posts/2015-08-understanding-lstms/ 23

(3) The old state C t-1 is multiplied by f t, to forget the things in the previous context that we decided to forget earlier. Then we add i t C~ t. This is the new candidate values, scaled by how much we decided to update each state value. CC tt = ff tt CC tt 1 + ii tt CC tt (4) We also decide which information to output (by filtering through the output gate). Also the hidden value for this time t is set by the output value multiplied by the tanh of the new context. oo tt = σσ WW oo h tt 1, xx tt + bb oo h tt = oo tt tanh CC tt 24

Gated Recurrent Unit (GRU) http://colah.github.io/posts/2015-08-understanding-lstms/ 25

8 LSTM Code Example 26

The data has 256 input variables (a vector of size 256 for one input instance) and 1 output variable. There are 128 (hidden) LSTM units for each time step/slice. Good explanation of the number of hidden units in LSTM: https://www.quora.com/what-is-the-relationship-between-timestep-andnumber-hidden-unit-in-lstm 27

The data has 256 input variables (a vector of size 256 for one input instance) and 1 output variable. There are 128 (hidden) LSTM units for each time step/slice. The task is binary classification. Good explanation of the number of hidden units in LSTM: https://www.quora.com/what-is-the-relationship-between-timestep-andnumber-hidden-unit-in-lstm 28

The data has 1 input variable and 1 output variable. 4 (hidden) LSTM units are chosen (for each time step/slice). The task is regression. Another example: https://machinelearningmastery.com/return-sequences-and-return-states-for-lstms-in-keras/ 29