October 20, 2017
Overview Supervised Learning Feedforward neural network Convolution neural network Recurrent neural network Recursive neural network (Recursive neural tensor network) Unsupervised Learning Autoencoder Boltzmann machine (Restricted Boltzman machine) Reinforcement Learning
Cartoon of ML
Fun Applications Figure: Instant visual translation (google blog).
Fun Applications Poetry Machine Q. Wang et.al, 2016
Fun Applications CartPole
Fun Applications Pong
Feedfoward NN (Binary Classification) Suppose we have two features and one hidden layer with 3 units
Feedfoward NN (Forward Pass) For each training sample, z i = tanh( j W ij x j + b i ), where W ij is the weight matrix and b i is the bias. ˆp(y = 1 x 1,x 2 ) = σ( i V i z i + c), where σ is the sigmoid function, V i is the weight vector and c is the bias 1. [ ] b W11 W W = 12 W 1 V 1 13, b = b W 21 W 22 W 2, V = V 2 23 b 3 V 3 z = tanh(w T x + b) ˆp(y = 1 x) = σ(v T z + c) 1 For multi-class classifications, we use the softmax function and the weight matrix V ik with k as the class label index.
Feedfoward NN (Forward Pass) Q: If we do not include the hidden layer, what is the neural network equivalent to?
Feedfoward NN (Backpropagation) The cross-entropy loss is J (W,b,V,c) = n i=1 where n is the number of samples. Let u = V T z + c, v = W T x + b, [y i ln ˆp i + (1 y i )ln(1 ˆp i )], J c = J ˆp i u i ˆp i u J V j = i J b j = i,k J ˆp i ˆp i u J ˆp i ˆp i u c = i u V j = i u z k z k b j = i J ˆp i σ (u), J ˆp i σ (u)z j, J ˆp i σ (u)v k z k b j.
Feedforward NN (Training) Algorithm 1 Training procudure. Require: Learning rate η Require: Initial parameters (random initialization) 1: while not converged do 2: Given weights, compute the estimated output via feedforward pass. 3: Given estimated output, compute the cost function and the gradient through backpropagtion. 4: Update: θ θ η θ J. 5: end while
Recap of Feedforward NN Components of FNN: a graph structure matrices Training: forward pass back propagation Applications: (multi-class) classification regression
Feedfoward NN (playground) Tensorflow playground
Applications in Recommender System
Wide & Deep Learning (Background) Wide and deep models (HT Cheng et al., 2016, Paul Covington, et al., 2016) have outstanding performances in recommender systems (TensorFlow Dev Summit, 2017).
Illustration of Wide & Deep Learning
Scheme of Wide & Deep Learning
A Toy Example of Embedding
Advantages of Wide & Deep Models Jointly train wide & deep parts (in contrast to ensemble models). Scalability (batch training). Easy to handle large sparse features (embedding).
Application in Purchase Prediction
Problem Description Data set: 6.5 millions samples with 170 raw features. (2 days of US data in 2017) Target: whether a user will make a purchase. Binary classification problem.
Data Pipeline Parse the data from disk as batches during the training process. (Data are not loaded into physical memory.) Continuous variables: real vectors constant tensors (dense). Categorical variables: vectors of string sparse tensors sparse tensors hashed with fixed bucket size (crossed sparse features) embedded into low dimensions.
Implementation Highlights Memory efficient: streaming batches in training and evaluation steps. Adaptive: pre-trained models can be restored and further trained on new incoming data. Models can be updated constantly. Hyper-parameters can be adjusted accordingly.
Training details The Follow-the-regularized-leader (FTRL) algorithm (H. Brendan McMahan, 2011) is used for wide model optimization and the Adam (Diederik P. Kingma et al., 2014) is used for deep model optimization. Batch size is set to 100. Early stopping is set to be inactive. Computational time: 2 days with 4 million iterations ( 80 epochs).
Models Models D h lr dp sparse wide1 - - 10 3 - No deep1 10 150,50 10 4 0 - deep2 10 200,150,100,50 10 4 0 - wide&deep1 10 200,150,100,50 10 3 /10 4 0 No Table: List of models performed in experiment 1. In this table, embedding size is represented as D, hidden layer sizes are denoted as h, learning rates as lr, dropout rates as dp. The last column indicates whether sparse cross features are included.
Results
Models Models D h lr dp sparse deep3 20 256,128 10 4 0.2 - deep4 10 512,256,128 10 4 0.5 - deep5 10 256,128 10 4 0.2 - wide&deep2 10 256,128 2 10 4 /10 4 0.2 No Table: List of models performed in experiment 2. We also include l 1 and l 2 regulations in the last model when optimizing the wide part.
Results Only three curves are shown, the wide&deep2 model obtains auc 0.5 (stuck at a local minimum).
Models Models D h lr dp sparse deep6 20 512,256,128 10 4 0.5 - deep7 20 512,256,128 5 10 5 0.5 - deep8 20 1024,512,256,128 2 10 5 0.75 - wide&deep3 20 512,256,128 5 10 4 /5 10 5 0.5 Yes Table: List of models performed in experiment 3.
Results
Tensorboard Tensorboard
Autoencoders Supervised machine learning models have the same API: train(x, Y) or fit (X, Y) predict (X) What if we made the NN just predict itself? train(x, X) That s an autoencoder! auto = self
Autoencoders Illustration z = f (W T x + b h ) ˆx = f (W z + b o ) The objective is to minimize the reconstruction error: J = n i=1 x i ˆx i 2 2= X ˆX 2 F. Figure: A toy autoencoder with one hidden layer. Q: Similar to any well-known unsupervised learning procedure?
Autoencoders as Nonlinear PCA PCA: J PCA = X ˆX 2 F = X X QQT 2 F, where ˆX is the low rank approximation of X. (x i R p, X n p, Q p d with d p) Autoencoder: J Auto = X ˆX 2 F = X f (f (X W )W T ) 2 F, where bias terms are neglected. If we take f as the identity function, then Autoencoders reduce to PCA.
Autoencoders (Visualization) Figure: Top: the architecture of deep autoencoders with hidden layers of sizes 500, 300 and 2. Bottom (from left to right): visualizations of MINST (Tutorial) data set, training after 0, 20 and 500 epochs, respectively.
RNN (Applications) Each rectangle is a vector and arrows represent functions (e.g. matrix multiply). Input vectors are in red, output vectors are in blue and green vectors (hidden layer) hold the RNN s state (Andrej Karpathy et al., 2016, Tutorial).
RNNs (Applications) Figure: Sequence output (e.g. image captioning takes an image and outputs a sentence of words). (Tutorial)
RNNs (Applications) Figure: Sequence input and sequence output (e.g. chatbot: an RNN reads a sentence and then outputs a correspondence sentence). (Tutorial)
RNN (Applications) In all scenarios, there are no pre-specified constraints on the sequence lengths, as the recurrent transformation can be applied as many times as desired.
RNNs (Structure) Figure: A recurrent neural network and the unfolding in time of the computation involved in its forward computation. (Yann LeCun, Yoshua Bengio & Geoffrey Hinton, 2015)
RNNs (Structure) x t is the input at time step t. For example, x t could be a one-hot vector corresponding to a word of a sentence. y t is the output at time step t. For example, if we wanted to predict the next word in a sentence it would be a vector of probabilities across our vocabulary, namely, y t = softmax(w o h t ). (Tutorial)
RNNs (Structure) h t is the hidden state at time step t. It s the memory of the network. h t is calculated based on the previous hidden state and the input at the current step: h t = f (W x x t + W h h t 1 ). The function f usually is nonlinear (such as tanh or ReLU). h 1, which is required to calculate the first hidden state, is typically initialized to all zeroes.
RNN A RNN shares the same parameters (W x, W h, W o above) across all time steps. This greatly reduces the total number of parameters. The training step is performed via back-propagation through time (BPTT). RNN architectures can be extended to Bidirectional RNNs, Deep (Bidirectional) RNNs, LSTM networks, etc.
Application in NLP
Word Embeddings (Motivations) If the total vocabulary size is V, then one-hot encoding of each word is a vector in R V. V can be very large, it is desired to embed words into a much lower dimension D. (A word embedding word R D is a paramaterized function mapping.) Euclidean distances are the same between any two different words! There are no word analogies.
Word Embeddings (History) Learning distributed representations of different concepts (Hinton, 1986). Learning a representation for each word (significantly improved over tri-gram models) (Y. Bengio, 2003). Directly extract word analogies and relations (T. Mikolov et al., 2013, T. Mikolov et al., 2013, Tutorial).
Word Embeddings Model: RNN with GRU units. Goal: find word analogies and visualize wording embeddings. Data set: part of the Wikipedia articles (100 files, 5.6 million sentences). Parameters: V = 2000, D = 30.
Word Analogies Figure: Word analogies for the third model: 8 words ordered according to the cosine similarity.
Word Embeddings (Visualization) Figure: Word embeddings via t-distributed stochastic neighbor embedding (t-sne).
Application in e-commerce
Model Illustration Figure: A toy RNN model with one hidden layer and one input feature.
Data Preparation Time interval: 2017/05/09, 4 : 00 AM to 2017/05/10, 4 : 00 AM (PST) Sample size: 100, 000 users (selected randomly) Maximum sequence length: 300 sequences with larger lengths are truncated 2 (restricted by computational time) 2 Analyses with fixed sequence length are highly biased.
Training details Hardware: desktop (24 cores, 32 GB RAM, 1080Ti GPU) deap-dsci1.phx01, peap-dsci3.phx01 (48 cores, 128 GB RAM) Tensorflow version: r1.1 Computational time: 9.8 minutes per epoch, 5 hours (30 epochs) to 3 days (450 epochs)
Training details Optimizer: Adam (Diederik P. Kingma et al., 2014) Model architecture: LSTMs (S. Hochreiter et al., 1997), GRUs (Cho et al., 2014) 1 3 hidden layers hidden layer size 50 or 100 Batch size: 100 Early stopping: active according to validation auc 0.7/0.1/0.2 for train/validation/test data split Dropout: keep probability 1.0, 0.8, 0.6 (W. Zaremba et al., 2015)
RNNs vs. Baseline Models Models/Features rendered +paid +feature3 Reg(pad 0) 0.590 0.599 0.717 Reg(pad 1) 0.696 0.704 0.720 RNNs 0.714 0.722 0.797 Table: All baseline models (denoted as Reg) are l 2 -penalized logistic regressions. Five fold cross-validations are applied to tune the penalty parameter (maximizing the CV auc). Test auc are reported in this table.
Results (prob) ads serving Figure: Predicted purchase probability with is rendered as an input feature.