A Simple (?) Exercise: Predicting the Next Word

Similar documents
Simple and Efficient Learning with Automatic Operation Batching

Lecture 20: Neural Networks for NLP. Zubin Pahuja

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Deep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES

Deep Neural Networks Optimization

Why is word2vec so fast? Efficiency tricks for neural nets

Efficient Deep Learning Optimization Methods

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016

CS489/698: Intro to ML

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

Lecture : Training a neural net part I Initialization, activations, normalizations and other practical details Anne Solberg February 28, 2018

Gradient Descent Optimization Algorithms for Deep Learning Batch gradient descent Stochastic gradient descent Mini-batch gradient descent

Convolutional Networks for Text

Structured Prediction Basics

CS 6501: Deep Learning for Computer Graphics. Training Neural Networks II. Connelly Barnes

Multinomial Regression and the Softmax Activation Function. Gary Cottrell!

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3

FastText. Jon Koss, Abhishek Jindal

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

Perceptron: This is convolution!

Sentiment Classification of Food Reviews

CSC 578 Neural Networks and Deep Learning

Recurrent Neural Nets II

A Neuro Probabilistic Language Model Bengio et. al. 2003

Cost Functions in Machine Learning

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

EE 511 Neural Networks

Deep Learning with Tensorflow AlexNet

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

Neural Nets & Deep Learning

How Learning Differs from Optimization. Sargur N. Srihari

Conditioned Generation

A Quick Guide on Training a neural network using Keras.

11. Neural Network Regularization

Neural Network Neurons

An Introduction to NNs using Keras

Natural Language Processing with Deep Learning CS224N/Ling284

Deep Learning Applications

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

Recurrent Neural Networks

Simple Model Selection Cross Validation Regularization Neural Networks

Sparse Non-negative Matrix Language Modeling

Lecture : Neural net: initialization, activations, normalizations and other practical details Anne Solberg March 10, 2017

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Dropout. Sargur N. Srihari This is part of lecture slides on Deep Learning:

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2016

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation

A Brief Look at Optimization

Natural Language Processing with Deep Learning CS224N/Ling284. Christopher Manning Lecture 4: Backpropagation and computation graphs

COMP6237 Data Mining Data Mining & Machine Learning with Big Data. Jonathon Hare

LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

CPSC 340: Machine Learning and Data Mining. Multi-Class Classification Fall 2017

Deep Neural Networks Applications in Handwriting Recognition

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

CS281 Section 3: Practical Optimization

LSTM: An Image Classification Model Based on Fashion-MNIST Dataset

SEMANTIC COMPUTING. Lecture 9: Deep Learning: Recurrent Neural Networks (RNNs) TU Dresden, 21 December 2018

Transition-Based Dependency Parsing with Stack Long Short-Term Memory

Unsupervised Learning

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

The Mathematics Behind Neural Networks

Technical University of Munich. Exercise 8: Neural Networks

Deep neural networks II

EECS 496 Statistical Language Models. Winter 2018

Ensemble methods in machine learning. Example. Neural networks. Neural networks

CS 4510/9010 Applied Machine Learning

Knowledge Discovery and Data Mining. Neural Nets. A simple NN as a Mathematical Formula. Notes. Lecture 13 - Neural Nets. Tom Kelsey.

Knowledge Discovery and Data Mining

K-Means and Gaussian Mixture Models

Deep Neural Networks Applications in Handwriting Recognition

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

CS 4510/9010 Applied Machine Learning. Neural Nets. Paula Matuszek Fall copyright Paula Matuszek 2016

Hyperparameter optimization. CS6787 Lecture 6 Fall 2017

CSC 2515 Introduction to Machine Learning Assignment 2

Why DNN Works for Speech and How to Make it More Efficient?

CS 224n: Assignment #3

SGD: Stochastic Gradient Descent

Deep Learning Workshop. Nov. 20, 2015 Andrew Fishberg, Rowan Zellers

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Sequence Modeling: Recurrent and Recursive Nets. By Pyry Takala 14 Oct 2015

Machine Learning With Python. Bin Chen Nov. 7, 2017 Research Computing Center

Neural Network Learning. Today s Lecture. Continuation of Neural Networks. Artificial Neural Networks. Lecture 24: Learning 3. Victor R.

Memory Bandwidth and Low Precision Computation. CS6787 Lecture 10 Fall 2018

CS 224d: Assignment #1

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

CS294-1 Assignment 2 Report

Advanced Video Analysis & Imaging

CS 179 Lecture 16. Logistic Regression & Parallel SGD

Opening the black box of Deep Neural Networks via Information (Ravid Shwartz-Ziv and Naftali Tishby) An overview by Philip Amortila and Nicolas Gagné

Machine Learning Basics. Sargur N. Srihari

Neural Network Joint Language Model: An Investigation and An Extension With Global Source Context

CS 124/LING 180/LING 280 From Languages to Information Week 2: Group Exercises on Language Modeling Winter 2018

MoonRiver: Deep Neural Network in C++

Package automl. September 13, 2018

Perceptron as a graph

Neural Networks. Robot Image Credit: Viktoriya Sukhanova 123RF.com

LSTM for Language Translation and Image Captioning. Tel Aviv University Deep Learning Seminar Oran Gafni & Noa Yedidia

The exam is closed book, closed notes except your one-page cheat sheet.

Transcription:

CS11-747 Neural Networks for NLP A Simple (?) Exercise: Predicting the Next Word Graham Neubig Site https://phontron.com/class/nn4nlp2017/

Are These Sentences OK? Jane went to the store. store to Jane went the. Jane went store. Jane goed to the store. The store went to Jane. The food truck went to Jane.

Calculating the Probability of a Sentence IY P (X) = P (x i x 1,...,x i 1 ) i=1 Next Word Context The big problem: How do we predict P (x i x 1,...,x i 1 )?!?!

Review: Count-based Language Models

Count-based Language Models Count up the frequency and divide: P ML (x i x i n+1,...,x i 1 ):= c(x i n+1,...,x i ) c(x i n+1,...,x i 1 ) Add smoothing, to deal with zero counts: P (x i x i n+1,...,x i 1 )= P ML (x i x i n+1,...,x i 1 ) +(1 )P (x i x 1 n+2,...,x i 1 ) Modified Kneser-Ney smoothing

A Refresher on Evaluation Log-likelihood: X LL(E test )= log P (E) E2E test Per-word Log Likelihood: 1 X WLL(E test )= P log P (E) E E2E test E2E test Per-word (Cross) Entropy: 1 X H(E test )= P log 2 P (E) Perplexity: E2E test E E2E test ppl(e test )=2 H(E test) = e WLL(E test)

What Can we Do w/ LMs? Score sentences: Jane went to the store. high store to Jane went the. low (same as calculating loss for training) Generate sentences: while didn t choose end-of-sentence symbol: calculate probability sample a new word from the probability distribution

Problems and Solutions? Cannot share strength among similar words she bought a car she purchased a car she bought a bicycle she purchased a bicycle solution: class based language models Cannot condition on context with intervening words Dr. Jane Smith Dr. Gertrude Smith solution: skip-gram language models Cannot handle long-distance dependencies for tennis class he wanted to buy his own racquet for programming class he wanted to buy his own computer solution: cache, trigger, topic, syntactic models, etc.

An Alternative: Featurized Log-Linear Models

An Alternative: Featurized Models Calculate features of the context Based on the features, calculate probabilities Optimize feature weights using gradient descent, etc.

Example: Previous words: giving a" a 3.0-6.0-0.2-3.2 the 2.5-5.1-0.3-2.9 talk gift b= -0.2 0.1 w1,a= 0.2 0.1 w2,giving= 1.0 2.0 s= 1.0 2.2 hat 1.2 0.5-1.2 0.6 Words we re predicting How likely are they? How likely are they given prev. word is a? How likely are they given 2nd prev. word is giving? Total score

Softmax Convert scores into probabilities by taking the exponent and normalizing (softmax) P (x i x i 1 i 1 i n+1 )= es(x i x P x e s( x i x i 1 i i n+1 ) i n+1 ) s= -3.2-2.9 1.0 2.2 0.6 p= 0.002 0.003 0.329 0.444 0.090

A Computation Graph View giving a lookup2 lookup1 bias scores + + = probs softmax Each vector is size of output vocabulary

A Note: Lookup Lookup can be viewed as grabbing a single vector from a big matrix of word embeddings vector size num. words lookup(2) Similarly, can be viewed as multiplying by a onehot vector num. words vector size * 0 0 1 0 0 Former tends to be faster

Training a Model Reminder: to train, we calculate a loss function (a measure of how bad our predictions are), and move the parameters to reduce the loss The most common loss function for probabilistic models is negative log likelihood 0.002 If element 3 0.003 (or zero-indexed, 2) p= 0.329 -log 1.112 is the correct answer: 0.444 0.090

Parameter Update Back propagation allows us to calculate the derivative of the loss with respect to the parameters @` @ Simple stochastic gradient descent optimizes parameters according to the following rule @` @

Choosing a Vocabulary

Unknown Words Necessity for UNK words We won t have all the words in the world in training data Larger vocabularies require more memory and computation time Common ways: Frequency threshold (usually UNK <= 1) Rank threshold

Evaluation and Vocabulary Important: the vocabulary must be the same over models you compare Or more accurately, all models must be able to generate the test set (it s OK if they can generate more than the test set, but not less) e.g. Comparing a character-based model to a word-based model is fair, but not vice-versa

Let s try it out! (loglin-lm.py)

What Problems are Handled? Cannot share strength among similar words she bought a car she purchased a car not solved yet she bought a bicycle she purchased a bicycle Cannot condition on context with intervening words Dr. Jane Smith Dr. Gertrude Smith solved! Cannot handle long-distance dependencies for tennis class he wanted to buy his own racquet for programming class he wanted to buy his own computer not solved yet

Beyond Linear Models

Linear Models can t Learn Feature Combinations farmers eat steak high farmers eat hay low cows eat steak low cows eat hay high These can t be expressed by linear features What can we do? Remember combinations as features (individual scores for farmers eat, cows eat ) Feature space explosion! Neural nets

Neural Language Models giving a (See Bengio et al. 2004) lookup lookup tanh( W1*h + b1) W + = softmax bias scores probs

Where is Strength Shared? giving lookup a lookup tanh( W1*h + b1) Similar output words get similar rows in in the softmax matrix Similar contexts get similar hidden states Word embeddings: W + = softmax Similar input words get similar vectors bias scores probs

What Problems are Handled? Cannot share strength among similar words she bought a car she purchased a car she bought a bicycle she purchased a bicycle solved, and similar contexts as well! Cannot condition on context with intervening words Dr. Jane Smith Dr. Gertrude Smith solved! Cannot handle long-distance dependencies for tennis class he wanted to buy his own racquet for programming class he wanted to buy his own computer not solved yet

Let s Try it Out! (nn-lm.py)

Tying Input/Output giving pick row a pick row Embeddings We can share parameters between the input and output embeddings (Press et al. 2016, inter alia) tanh( W1*h + b1) W + = softmax bias scores probs Want to try? Delete the input embeddings, and instead pick a row from the softmax matrix.

Training Tricks

Shuffling the Training Data Stochastic gradient methods update the parameters a little bit at a time What if we have the sentence I love this sentence so much! at the end of the training data 50 times? To train correctly, we should randomly shuffle the order at each time step

Other Optimization Options SGD with Momentum: Remember gradients from past time steps to prevent sudden changes Adagrad: Adapt the learning rate to reduce learning rate for frequently updated parameters (as measured by the variance of the gradient) Adam: Like Adagrad, but keeps a running average of momentum and gradient variance Many others: RMSProp, Adadelta, etc. (See Ruder 2016 reference for more details)

Early Stopping, Learning Rate Decay Neural nets have tons of parameters: we want to prevent them from over-fitting We can do this by monitoring our performance on held-out development data and stopping training when it starts to get worse It also sometimes helps to reduce the learning rate and continue training

Which One to Use? Adam is usually fast to converge and stable But simple SGD tends to do very will in terms of generalization You should use learning rate decay, (e.g. on Machine translation results by Denkowski & Neubig 2017)

Dropout Neural nets have lots of parameters, and are prone to overfitting Dropout: randomly zero-out nodes in the hidden layer with probability p at training time only x Because the number of nodes at training/test is different, scaling is necessary: Standard dropout: scale by p at test time x Inverted dropout: scale by 1/(1-p) at training time

Let s Try it Out! (nn-lm-optim.py)

Efficiency Tricks: Operation Batching

Efficiency Tricks: Mini-batching On modern hardware 10 operations of size 1 is much slower than 1 operation of size 10 Minibatching combines together smaller operations into one big one

Minibatching

Manual Mini-batching Group together similar operations (e.g. loss calculations for a single word) and execute them all together In the case of a feed-forward language model, each word prediction in a sentence can be batched For recurrent neural nets, etc., more complicated DyNet has special minibatch operations for lookup and loss functions, everything else automatic

Mini-batched Code Example

Let s Try it Out! (nn-lm-batch.py)

Automatic Mini-batching! (see Neubig et al. 2017) Try it with the dynet-autobatch command line option

Autobatching Usage for each minibatch: for each data point in mini-batch: define/add data sum losses forward (autobatch engine does magic!) backward update

Speed Improvements

Questions?