Deep Learning Applications

Similar documents
CSC 578 Neural Networks and Deep Learning

SEMANTIC COMPUTING. Lecture 9: Deep Learning: Recurrent Neural Networks (RNNs) TU Dresden, 21 December 2018

Lecture 20: Neural Networks for NLP. Zubin Pahuja

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Deep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES

Machine Learning With Python. Bin Chen Nov. 7, 2017 Research Computing Center

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

Sentiment Classification of Food Reviews

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla

Encoding RNNs, 48 End of sentence (EOS) token, 207 Exploding gradient, 131 Exponential function, 42 Exponential Linear Unit (ELU), 44

Deep Learning for Computer Vision II

A Quick Guide on Training a neural network using Keras.

Recurrent Neural Networks

CS489/698: Intro to ML

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning,

Recurrent Neural Nets II

FastText. Jon Koss, Abhishek Jindal

XES Tensorflow Process Prediction using the Tensorflow Deep-Learning Framework

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

Machine Learning 13. week

Akarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

Machine Learning. The Breadth of ML Neural Networks & Deep Learning. Marc Toussaint. Duy Nguyen-Tuong. University of Stuttgart

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3

Sequence Modeling: Recurrent and Recursive Nets. By Pyry Takala 14 Oct 2015

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016

LSTM for Language Translation and Image Captioning. Tel Aviv University Deep Learning Seminar Oran Gafni & Noa Yedidia

Research on Pruning Convolutional Neural Network, Autoencoder and Capsule Network

Object Detection Lecture Introduction to deep learning (CNN) Idar Dyrdal

Domain-Aware Sentiment Classification with GRUs and CNNs

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 15

Advanced Introduction to Machine Learning, CMU-10715

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Deep Learning with Tensorflow AlexNet

Restricted Boltzmann Machines. Shallow vs. deep networks. Stacked RBMs. Boltzmann Machine learning: Unsupervised version

Grounded Compositional Semantics for Finding and Describing Images with Sentences

Rationalizing Sentiment Analysis in Tensorflow

Deep Learning Cook Book

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016

ImageNet Classification with Deep Convolutional Neural Networks

Tutorial on Keras CAP ADVANCED COMPUTER VISION SPRING 2018 KISHAN S ATHREY

Recurrent Neural Network (RNN) Industrial AI Lab.

Deep Learning Workshop. Nov. 20, 2015 Andrew Fishberg, Rowan Zellers

Deep Neural Networks Optimization

Neural Network Neurons

Image Captioning with Object Detection and Localization

Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks

LSTM and its variants for visual recognition. Xiaodan Liang Sun Yat-sen University

Deep Learning. Volker Tresp Summer 2014

Neural Networks for Machine Learning. Lecture 15a From Principal Components Analysis to Autoencoders

Tutorial on Machine Learning Tools

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation

CS 1674: Intro to Computer Vision. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh November 16, 2016

A Deep Learning primer

Neural Network Models for Text Classification. Hongwei Wang 18/11/2016

Empirical Evaluation of RNN Architectures on Sentence Classification Task

Lecture 17: Neural Networks and Deep Learning. Instructor: Saravanan Thirumuruganathan

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Neural Networks and Deep Learning

CS 6501: Deep Learning for Computer Graphics. Training Neural Networks II. Connelly Barnes

Natural Language Processing with Deep Learning CS224N/Ling284. Christopher Manning Lecture 4: Backpropagation and computation graphs

Gate-Variants of Gated Recurrent Unit (GRU) Neural Networks

arxiv: v1 [cs.cl] 18 Jan 2015

CUED-RNNLM An Open-Source Toolkit for Efficient Training and Evaluation of Recurrent Neural Network Language Models

Frameworks in Python for Numeric Computation / ML

Dynamic Routing Between Capsules

Neural Networks for unsupervised learning From Principal Components Analysis to Autoencoders to semantic hashing

Asynchronous Parallel Learning for Neural Networks and Structured Models with Dense Features

Deep Learning. Deep Learning. Practical Application Automatically Adding Sounds To Silent Movies

Deep Learning and Its Applications

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

An Introduction to Deep Learning with RapidMiner. Philipp Schlunder - RapidMiner Research

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Deep Learning on Graphs

Introduction to Deep Learning

Machine Learning. MGS Lecture 3: Deep Learning

DECISION TREES & RANDOM FORESTS X CONVOLUTIONAL NEURAL NETWORKS

Code Mania Artificial Intelligence: a. Module - 1: Introduction to Artificial intelligence and Python:

Deep Learning. Volker Tresp Summer 2017

Deep Learning on Graphs

Predict the Likelihood of Responding to Direct Mail Campaign in Consumer Lending Industry

CSE 250B Project Assignment 4

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

Can Active Memory Replace Attention?

Deep Learning for Computer Vision

Recurrent Neural Networks. Nand Kishore, Audrey Huang, Rohan Batra

Lecture 7: Neural network acoustic models in speech recognition

Deep Learning. Deep Learning provided breakthrough results in speech recognition and image classification. Why?

RNN LSTM and Deep Learning Libraries

Weiguang Guan Code & data: guanw.sharcnet.ca/ss2017-deeplearning.tar.gz

DeepWalk: Online Learning of Social Representations

Practical Deep Learning

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides

Convolutional Networks for Text

COMP9444 Neural Networks and Deep Learning 5. Geometry of Hidden Units

Unsupervised Learning

Energy Based Models, Restricted Boltzmann Machines and Deep Networks. Jesse Eickholt

Alternatives to Direct Supervision

MoonRiver: Deep Neural Network in C++

Transcription:

October 20, 2017

Overview Supervised Learning Feedforward neural network Convolution neural network Recurrent neural network Recursive neural network (Recursive neural tensor network) Unsupervised Learning Autoencoder Boltzmann machine (Restricted Boltzman machine) Reinforcement Learning

Cartoon of ML

Fun Applications Figure: Instant visual translation (google blog).

Fun Applications Poetry Machine Q. Wang et.al, 2016

Fun Applications CartPole

Fun Applications Pong

Feedfoward NN (Binary Classification) Suppose we have two features and one hidden layer with 3 units

Feedfoward NN (Forward Pass) For each training sample, z i = tanh( j W ij x j + b i ), where W ij is the weight matrix and b i is the bias. ˆp(y = 1 x 1,x 2 ) = σ( i V i z i + c), where σ is the sigmoid function, V i is the weight vector and c is the bias 1. [ ] b W11 W W = 12 W 1 V 1 13, b = b W 21 W 22 W 2, V = V 2 23 b 3 V 3 z = tanh(w T x + b) ˆp(y = 1 x) = σ(v T z + c) 1 For multi-class classifications, we use the softmax function and the weight matrix V ik with k as the class label index.

Feedfoward NN (Forward Pass) Q: If we do not include the hidden layer, what is the neural network equivalent to?

Feedfoward NN (Backpropagation) The cross-entropy loss is J (W,b,V,c) = n i=1 where n is the number of samples. Let u = V T z + c, v = W T x + b, [y i ln ˆp i + (1 y i )ln(1 ˆp i )], J c = J ˆp i u i ˆp i u J V j = i J b j = i,k J ˆp i ˆp i u J ˆp i ˆp i u c = i u V j = i u z k z k b j = i J ˆp i σ (u), J ˆp i σ (u)z j, J ˆp i σ (u)v k z k b j.

Feedforward NN (Training) Algorithm 1 Training procudure. Require: Learning rate η Require: Initial parameters (random initialization) 1: while not converged do 2: Given weights, compute the estimated output via feedforward pass. 3: Given estimated output, compute the cost function and the gradient through backpropagtion. 4: Update: θ θ η θ J. 5: end while

Recap of Feedforward NN Components of FNN: a graph structure matrices Training: forward pass back propagation Applications: (multi-class) classification regression

Feedfoward NN (playground) Tensorflow playground

Applications in Recommender System

Wide & Deep Learning (Background) Wide and deep models (HT Cheng et al., 2016, Paul Covington, et al., 2016) have outstanding performances in recommender systems (TensorFlow Dev Summit, 2017).

Illustration of Wide & Deep Learning

Scheme of Wide & Deep Learning

A Toy Example of Embedding

Advantages of Wide & Deep Models Jointly train wide & deep parts (in contrast to ensemble models). Scalability (batch training). Easy to handle large sparse features (embedding).

Application in Purchase Prediction

Problem Description Data set: 6.5 millions samples with 170 raw features. (2 days of US data in 2017) Target: whether a user will make a purchase. Binary classification problem.

Data Pipeline Parse the data from disk as batches during the training process. (Data are not loaded into physical memory.) Continuous variables: real vectors constant tensors (dense). Categorical variables: vectors of string sparse tensors sparse tensors hashed with fixed bucket size (crossed sparse features) embedded into low dimensions.

Implementation Highlights Memory efficient: streaming batches in training and evaluation steps. Adaptive: pre-trained models can be restored and further trained on new incoming data. Models can be updated constantly. Hyper-parameters can be adjusted accordingly.

Training details The Follow-the-regularized-leader (FTRL) algorithm (H. Brendan McMahan, 2011) is used for wide model optimization and the Adam (Diederik P. Kingma et al., 2014) is used for deep model optimization. Batch size is set to 100. Early stopping is set to be inactive. Computational time: 2 days with 4 million iterations ( 80 epochs).

Models Models D h lr dp sparse wide1 - - 10 3 - No deep1 10 150,50 10 4 0 - deep2 10 200,150,100,50 10 4 0 - wide&deep1 10 200,150,100,50 10 3 /10 4 0 No Table: List of models performed in experiment 1. In this table, embedding size is represented as D, hidden layer sizes are denoted as h, learning rates as lr, dropout rates as dp. The last column indicates whether sparse cross features are included.

Results

Models Models D h lr dp sparse deep3 20 256,128 10 4 0.2 - deep4 10 512,256,128 10 4 0.5 - deep5 10 256,128 10 4 0.2 - wide&deep2 10 256,128 2 10 4 /10 4 0.2 No Table: List of models performed in experiment 2. We also include l 1 and l 2 regulations in the last model when optimizing the wide part.

Results Only three curves are shown, the wide&deep2 model obtains auc 0.5 (stuck at a local minimum).

Models Models D h lr dp sparse deep6 20 512,256,128 10 4 0.5 - deep7 20 512,256,128 5 10 5 0.5 - deep8 20 1024,512,256,128 2 10 5 0.75 - wide&deep3 20 512,256,128 5 10 4 /5 10 5 0.5 Yes Table: List of models performed in experiment 3.

Results

Tensorboard Tensorboard

Autoencoders Supervised machine learning models have the same API: train(x, Y) or fit (X, Y) predict (X) What if we made the NN just predict itself? train(x, X) That s an autoencoder! auto = self

Autoencoders Illustration z = f (W T x + b h ) ˆx = f (W z + b o ) The objective is to minimize the reconstruction error: J = n i=1 x i ˆx i 2 2= X ˆX 2 F. Figure: A toy autoencoder with one hidden layer. Q: Similar to any well-known unsupervised learning procedure?

Autoencoders as Nonlinear PCA PCA: J PCA = X ˆX 2 F = X X QQT 2 F, where ˆX is the low rank approximation of X. (x i R p, X n p, Q p d with d p) Autoencoder: J Auto = X ˆX 2 F = X f (f (X W )W T ) 2 F, where bias terms are neglected. If we take f as the identity function, then Autoencoders reduce to PCA.

Autoencoders (Visualization) Figure: Top: the architecture of deep autoencoders with hidden layers of sizes 500, 300 and 2. Bottom (from left to right): visualizations of MINST (Tutorial) data set, training after 0, 20 and 500 epochs, respectively.

RNN (Applications) Each rectangle is a vector and arrows represent functions (e.g. matrix multiply). Input vectors are in red, output vectors are in blue and green vectors (hidden layer) hold the RNN s state (Andrej Karpathy et al., 2016, Tutorial).

RNNs (Applications) Figure: Sequence output (e.g. image captioning takes an image and outputs a sentence of words). (Tutorial)

RNNs (Applications) Figure: Sequence input and sequence output (e.g. chatbot: an RNN reads a sentence and then outputs a correspondence sentence). (Tutorial)

RNN (Applications) In all scenarios, there are no pre-specified constraints on the sequence lengths, as the recurrent transformation can be applied as many times as desired.

RNNs (Structure) Figure: A recurrent neural network and the unfolding in time of the computation involved in its forward computation. (Yann LeCun, Yoshua Bengio & Geoffrey Hinton, 2015)

RNNs (Structure) x t is the input at time step t. For example, x t could be a one-hot vector corresponding to a word of a sentence. y t is the output at time step t. For example, if we wanted to predict the next word in a sentence it would be a vector of probabilities across our vocabulary, namely, y t = softmax(w o h t ). (Tutorial)

RNNs (Structure) h t is the hidden state at time step t. It s the memory of the network. h t is calculated based on the previous hidden state and the input at the current step: h t = f (W x x t + W h h t 1 ). The function f usually is nonlinear (such as tanh or ReLU). h 1, which is required to calculate the first hidden state, is typically initialized to all zeroes.

RNN A RNN shares the same parameters (W x, W h, W o above) across all time steps. This greatly reduces the total number of parameters. The training step is performed via back-propagation through time (BPTT). RNN architectures can be extended to Bidirectional RNNs, Deep (Bidirectional) RNNs, LSTM networks, etc.

Application in NLP

Word Embeddings (Motivations) If the total vocabulary size is V, then one-hot encoding of each word is a vector in R V. V can be very large, it is desired to embed words into a much lower dimension D. (A word embedding word R D is a paramaterized function mapping.) Euclidean distances are the same between any two different words! There are no word analogies.

Word Embeddings (History) Learning distributed representations of different concepts (Hinton, 1986). Learning a representation for each word (significantly improved over tri-gram models) (Y. Bengio, 2003). Directly extract word analogies and relations (T. Mikolov et al., 2013, T. Mikolov et al., 2013, Tutorial).

Word Embeddings Model: RNN with GRU units. Goal: find word analogies and visualize wording embeddings. Data set: part of the Wikipedia articles (100 files, 5.6 million sentences). Parameters: V = 2000, D = 30.

Word Analogies Figure: Word analogies for the third model: 8 words ordered according to the cosine similarity.

Word Embeddings (Visualization) Figure: Word embeddings via t-distributed stochastic neighbor embedding (t-sne).

Application in e-commerce

Model Illustration Figure: A toy RNN model with one hidden layer and one input feature.

Data Preparation Time interval: 2017/05/09, 4 : 00 AM to 2017/05/10, 4 : 00 AM (PST) Sample size: 100, 000 users (selected randomly) Maximum sequence length: 300 sequences with larger lengths are truncated 2 (restricted by computational time) 2 Analyses with fixed sequence length are highly biased.

Training details Hardware: desktop (24 cores, 32 GB RAM, 1080Ti GPU) deap-dsci1.phx01, peap-dsci3.phx01 (48 cores, 128 GB RAM) Tensorflow version: r1.1 Computational time: 9.8 minutes per epoch, 5 hours (30 epochs) to 3 days (450 epochs)

Training details Optimizer: Adam (Diederik P. Kingma et al., 2014) Model architecture: LSTMs (S. Hochreiter et al., 1997), GRUs (Cho et al., 2014) 1 3 hidden layers hidden layer size 50 or 100 Batch size: 100 Early stopping: active according to validation auc 0.7/0.1/0.2 for train/validation/test data split Dropout: keep probability 1.0, 0.8, 0.6 (W. Zaremba et al., 2015)

RNNs vs. Baseline Models Models/Features rendered +paid +feature3 Reg(pad 0) 0.590 0.599 0.717 Reg(pad 1) 0.696 0.704 0.720 RNNs 0.714 0.722 0.797 Table: All baseline models (denoted as Reg) are l 2 -penalized logistic regressions. Five fold cross-validations are applied to tune the penalty parameter (maximizing the CV auc). Test auc are reported in this table.

Results (prob) ads serving Figure: Predicted purchase probability with is rendered as an input feature.