Learning Meanings for Sentences with Recursive Autoencoders

Similar documents
CSE 250B Project Assignment 4

Sentiment Analysis using Recursive Neural Network

Logistic Regression. Abstract

Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank text

Network Traffic Measurements and Analysis

Grounded Compositional Semantics for Finding and Describing Images with Sentences

Conditional Random Fields for Word Hyphenation

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Ensemble methods in machine learning. Example. Neural networks. Neural networks

CS 224d: Assignment #1

Week 3: Perceptron and Multi-layer Perceptron

DeepWalk: Online Learning of Social Representations

CS 224N: Assignment #1

Context Encoding LSTM CS224N Course Project

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Backpropagation + Deep Learning

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Review: Final Exam CPSC Artificial Intelligence Michael M. Richter

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

Programming Exercise 4: Neural Networks Learning

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2016

CSE 250B Assignment 4 Report

CS6220: DATA MINING TECHNIQUES

1 Training/Validation/Testing

(Multinomial) Logistic Regression + Feature Engineering

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

CS 224N: Assignment #1

Neural Network Neurons

Machine Learning. The Breadth of ML Neural Networks & Deep Learning. Marc Toussaint. Duy Nguyen-Tuong. University of Stuttgart

CPSC 340: Machine Learning and Data Mining. Logistic Regression Fall 2016

Homework 2. Due: March 2, 2018 at 7:00PM. p = 1 m. (x i ). i=1

Machine Problem 8 - Mean Field Inference on Boltzman Machine

Neural Networks and Deep Learning

Simple Model Selection Cross Validation Regularization Neural Networks

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

Lecture 20: Neural Networks for NLP. Zubin Pahuja

University of Wisconsin-Madison Spring 2018 BMI/CS 776: Advanced Bioinformatics Homework #2

Multiview Stereo COSC450. Lecture 8

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

Using Machine Learning to Optimize Storage Systems

Regularization and model selection

COMPUTATIONAL INTELLIGENCE

Neural Networks. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Akarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

CSE 417 Dynamic Programming (pt 4) Sub-problems on Trees

Recurrent Neural Networks. Nand Kishore, Audrey Huang, Rohan Batra

Machine Learning Classifiers and Boosting

Machine Learning With Python. Bin Chen Nov. 7, 2017 Research Computing Center

CMPT 882 Week 3 Summary

Machine Learning: Think Big and Parallel

A Survey on Postive and Unlabelled Learning

Natural Language Processing with Deep Learning CS224N/Ling284. Christopher Manning Lecture 4: Backpropagation and computation graphs

Classification and Regression Trees

10-701/15-781, Fall 2006, Final

Mini-project 2 CMPSCI 689 Spring 2015 Due: Tuesday, April 07, in class

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

CS294-1 Assignment 2 Report

Optimum Alphabetic Binary Trees T. C. Hu and J. D. Morgenthaler Department of Computer Science and Engineering, School of Engineering, University of C

Boosting Simple Model Selection Cross Validation Regularization. October 3 rd, 2007 Carlos Guestrin [Schapire, 1989]

Bilevel Sparse Coding

Supervised Belief Propagation: Scalable Supervised Inference on Attributed Networks

15-780: Problem Set #2

1 Format. 2 Topics Covered. 2.1 Minimal Spanning Trees. 2.2 Union Find. 2.3 Greedy. CS 124 Quiz 2 Review 3/25/18

Deep Neural Networks with Flexible Activation Function

Clustering and Visualisation of Data

Linear Regression Optimization

More on Neural Networks. Read Chapter 5 in the text by Bishop, except omit Sections 5.3.3, 5.3.4, 5.4, 5.5.4, 5.5.5, 5.5.6, 5.5.7, and 5.

CPSC 340: Machine Learning and Data Mining. Robust Regression Fall 2015

Uninformed Search Methods. Informed Search Methods. Midterm Exam 3/13/18. Thursday, March 15, 7:30 9:30 p.m. room 125 Ag Hall

Balanced Search Trees

More on Learning. Neural Nets Support Vectors Machines Unsupervised Learning (Clustering) K-Means Expectation-Maximization

Neural Network Weight Selection Using Genetic Algorithms

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Priority Queues / Heaps Date: 9/27/17

Predict the Likelihood of Responding to Direct Mail Campaign in Consumer Lending Industry

Deep Learning Applications

Introduction to Deep Learning

CSE 546 Machine Learning, Autumn 2013 Homework 2

6.867 Machine Learning

Lecture 21 : A Hybrid: Deep Learning and Graphical Models

Deep Generative Models Variational Autoencoders

CSC 411 Lecture 18: Matrix Factorizations

Convex and Distributed Optimization. Thomas Ropars

Comparing Univariate and Multivariate Decision Trees *

Perceptron: This is convolution!

Model Generalization and the Bias-Variance Trade-Off

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Rank Measures for Ordering

Trees. 3. (Minimally Connected) G is connected and deleting any of its edges gives rise to a disconnected graph.

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Machine Learning Lecture-1

Deep neural networks II

Unsupervised Learning

An Algorithm For Training Multilayer Perceptron (MLP) For Image Reconstruction Using Neural Network Without Overfitting.

Multi-Class Logistic Regression and Perceptron

Logistic Regression and Gradient Ascent

CS 1674: Intro to Computer Vision. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh November 16, 2016

Deep Learning for Visual Computing Prof. Debdoot Sheet Department of Electrical Engineering Indian Institute of Technology, Kharagpur

Algorithms: Decision Trees

Machine Learning / Jan 27, 2010

Robust PDF Table Locator

Transcription:

Learning Meanings for Sentences with Recursive Autoencoders Tsung-Yi Lin and Chen-Yu Lee Department of Electrical and Computer Engineering University of California, San Diego {tsl008, chl260}@ucsd.edu March 18, 2013 Abstract The objective of this project is to implement the recursive auto encoder (RAE) method to learn a model to predict sentiments for sentences and reproduce the result in [1]. To learn the weights for recursive functions, we implement forward and backward propagation algorithms. We validate the gradient computed from forward and backward algorithm by comparing it to the gradient computed from numerical approximation. Our implementation could achieve accuracy 48.5% with the basic RAE implementation. We show that the system achieves accuracy 76.0% by using normalized transfer function and adjustable meaning vectors. 1 Introduction Learning meanings for sentences is an important application. We leverage recursive auto encoder (RAE) to learn a model to predict the meaning of sentence. At the same time, we learn the structure of sentences and meaning vector of words with semi-supervised algorithm. In this project, we implement the recursively structured function to this task and reproduce the result in [1]. We use forward propagation with greedy algorithm to predict the tree structure of a sentence, and then use backward propagation to compute all needed gradients values efficiently to update weights in our recursively structured model. To validate the correctness of gradient computation, we compare our gradient values with numerical differentiation. We can compute correct gradient values with differences between numerical differentiation under 10 9. In the experiment section, we implement RAE with four different algorithms. we first implement basic RAE tree using un-normalized transfer function and fixed meaning vectors for input words. And we produce the rest of three variations by adding different combinations of these two methods to investigate the impact of each method. The experiment result suggests using normalized feature and adjusting meaning vector achieves highest accuracy for sentiment prediction. Empirically, we demonstrate the ability of RAE model to predict the sentiment of sentences and words by showing top 10 best and worst movie review sentences and words. We find RAE can predict sentiments for whole sentence but fail to predict the word-level sentiments since we do not assign sentiments for each word while training. We also show the learned meaning vectors can be used for retrieving sentence with similar meaning. We also show that our model can predict correct tree structure of a sentence. 2 Algorithm 2.1 Unsupervised Autoencoders The goal of using autoencoders is to do unsupervised learning through reconstruction error between two children nodes and two reconstructed nodes. Let node k has two children nodes i and j, whose 1

meaning vector is x i and x j in length d. We can encode the meanings of two children nodes to generate a meaning vector for node k using: c k = h(w 1 [c i ; c j ] + b 1 ) (1) where W 1 R d 2d and b 1 R d are weights we need to learn. We use point-wise tanh(x) function for h to map each element from R to [ 1, 1]. Now we can try to reconstruct the meanings of children node i and j from c k by using: we can then use square loss to compute reconstruction error E 1 by: [c i; c j] = W 2 c k + b 2 (2) E 1 = n i n i + n j c i c i 2 + n j cj c j (3) n i + n j where n i and n j are number of nodes covered by node i and j. Note that we do not need any true label information to compute E 1, so this is unsupervised autoencoders learning. 2.2 Forward Propagation To get a single meaning of a sentence, we need to know the tree structure for n children nodes and encode pairs of two children nodes until we get a root node. We use greedy algorithm to find the tree structure: First, we compute n 1 reconstruction error for each pair of nodes, and then combine the pair of nodes with smallest error. After that, we compute n 2 reconstruction error for pairs of the rest nodes and the new node. Finally, we can create a root node using this procedure. The reconstruction error for the whole sentence is E 1 (k) (4) k T where T is the set of non-leaf nodes of the tree. Greedy algorithm could find the tree structure faster than exhaustive enumeration, but not always optimal. We introduce a target value t for each sentence. In this work, we only have binary labels for each sentence. We define the sentiment p of each sentence is a binomial distribution for two states: positive and negative. We can then compute the sentiment error using log loss between target value and sentiment output 2 E 2 (k) = t i logp i (5) where the sentiment output p is computed from meaning vector c k using logistic regression classifier: i=1 p = softmax(w label c k ) (6) where W label is a semi-supervised auto encoder in R 2 d that need to be learned. Here we apply sentiment error to all non-leaf nodes. 2

2.3 Learning by Backward Propagation Now we can combine reconstruction error and sentiment error and form an objective function over N labeled training sentences: J = 1 N <s,t> S E(s, t, θ) + λ 2 θ 2 (7) where θ =< W 1, b 1, W 2, b 2, W label > is all parameters that need to be learned, λ is the strength of the regularization term, and E(s, t, θ) is the total loss of one sentence with target value t from the training data: E(s, t, θ) = k T αe 1 (k) + (1 α)e 2 (k) (8) where T is the set of non-leaf nodes and α is the hyperparameter to give different weights between reconstruction error and sentiment error. Now we can learn the parameter θ using gradient descent to minimize our objective function J θ := θ + dj dθ (9) We use backward propagation algorithm to compute the gradient efficiently. The element W ij is a scalar in ith row and jth column of W matrix, and it affect the total loss J through the subnode a i of the activation a. Therefore, J W ij = J Since a = [W, b][c i ; c j ; 1] and r = h(a), so = δ i (10) W ij W ij W ij = [c i ; c j ; 1] T (11) To compute δ i, we need to consider two cases. First, for each output node with vector value r and loss function L, δ i = J = L i = L i r i r i = L i r i h (a i ) (12) and for each internal node, consider its activation a feed into its parent nodes indexed by k δ i = J = k d J a k p a k p=1 p = k d p=1 δ p a k p k (13) where δ k p is from its parent node, and this is the place where backpropagation happens. Note that a k = W [...; h(a);...; 1], so 3

a k p = h (a i )V k pi (14) where V k is the portion of the weight matrix W that multiplies h(a). Therefore, the whole delta vector for each internal node is δ = h (a) [ k (δ k ) T V k ] T (15) where is point-wise multiplication. 2.4 Numerical Differentiation We perform sanity check by comparing the gradient values computed between backpropagation and numerical differentiation. The numerical value is computed by central difference: J = J(w ij + ɛ) J(w ij ɛ) + O(ɛ 2 ) (16) W ij 2ɛ) The time complexity of numerical differentiation is O(d 4 ) because we have d 2 elements in W ij. For each element we need to compute J(w ij ) in O(d 2 ), so total time complexity would be O(d 2 d 2 ) = O(d 4 ). However, we only need O(d 2 ) for backpropagation because it only requires simple pointwise multiplication for d 2 elements in W ij. To traverse and compare all n nodes in a tree, the complexity is O(nd 2 + n 2 ). In our case, d 2 >> n so the complexity can be approximated by O(nd 2 ). 3 Experimental Results In this section, we explain our experiment setup, investigate the trained model, and evaluate the testing accuracies under different parameter setting. 3.1 Experiment Setup We use movie reviews dataset 1 mentioned in [1]. There are total 10662 sentences and 14043 words in the dictionary. We randomly select 2000 sentences for training set and 200 sentences for testing set with vector length d= 20 and α = 0.05 in our experiment. 3.2 Investigation of Trained Model In this section we analyze our model in four different ways. First we show the ten most positive and negative words in Table 1. We can not observe the strong relationship between words and sentiments. It is probably because we do not apply sentiment error to leaf nodes, so the learning process does not try to minimize the sentiment error for leaf nodes. Therefore, input word vectors can not be mapped to target class by applying learned weight matrix. Positive words Negative words what; current; group; Romano; But; take; received; officer; fire; difference last; crashed; union; than; being; barrel; bounce; imprisoned; to; of; Table 1: Ten most positive and ten most negative words. 1 http://www.cs.cornell.edu/people/pabo/movie-review-data/ 4

Second, we show the ten phrases predicted to be most positive and most negative from our testing set in Table 2 and Table 3. We can clearly see that our trained model capture correct sentiments for sentences. This is because our model try to minimize the sentiment error for sentences level instead of individual words. Number Positive Sentences 1 a tremendous piece of work. 2 slight but enjoyable documentary. 3 a moving and important film. 4 touch! 5 *UNKNOWN* professional but finally slight. 6 psychologically revealing. 7 *UNKNOWN* mean-spirited and wryly observant. 8 a well-executed spy-thriller. 9 sexy and romantic. 10 exciting documentary. Table 2: Ten most positive phrases. Number Negative Sentences 1 i hate this movie 2 silly, loud and goofy. 3 should have been someone else 4 the action *UNKNOWN* s just pile up. 5 the script is too mainstream and the psychology too textbook to intrigue. 6 aggravating and tedious. 7... overly melodramatic... 8... silly *UNKNOWN*... 9... unbearably lame. 10 without the *UNKNOWN* satiric humor. Table 3: Examples of n-grams from the training data of the movie review dataset. Third, we choose one interesting sentence: one big blustery movie where nothing really happens. when it comes out on video, then it s the perfect cure for insomnia. and then we find the most similar sentence by selecting the smallest l 2 norm distance of the meaning vectors among all other sentences: it won t be long before you ll spy at a video store near you. It is interesting that the model could produce similar meaning vector for similar sentences. It seems that the RAE model could group words and phrases with similar meaning together in the meaning space. Finally, we pick an interesting sentences and show the tree structure that the greedy algorithm finds for it in Figure 1. The greedy algorithm could find reasonable parsing for sentences in our implementation. 3.3 Accuracy Evaluation We discuss whether it is necessary to adjust the vector representing each word and normalize the meaning vector for words and phrases. Table 4 shows four different experiment settings using the 5

1 0.9 0.8 Level Lines 11 0.7 8 0.6 0.5 9 12 0.4 0.3 10 13 0.2 0.1 1 2 3 4 5 6 7... a roller- ride of a movie coaster 0 0 0.2 0.4 0.6 0.8 1 height = 4 Figure 1: Tree structure found by greedy algorithm. same training and testing set with the same initialization parameters. Setting 1 achieves the highest training and testing accuracy. However, we can not see huge difference between setting 2,3 and 4 probably because we only use meaning vector with length 20, and therefore gradient descent algorithm could be trapped in local minimum. Setting Normalization Meaning Vector Learning Training Acc Testing Acc 1 89.0% 76.0% 2 52.8% 48.5% 3 50.9% 51.0% 4 50.5% 48.5% Table 4: Training and testing accuracies under different setting. 4 Discussion In this report, we apply RAE for sentiment prediction. We implement efficient forward and backward algorithms for gradient computation which has complexity O(nd 2 ). The computed gradients are validated by comparing to the gradients computed from numerical approximation. We control two factors 1) transfer function normalization and 2) word meaning vector adjustment to investigate their effects on training and inference. We find the algorithm is more robust in avoiding local minimum and faster convergence by adding transfer function normalization and word meaning vector adjustment in the RAE algorithms. References [1] R. Socher, J. Pennington, E. H. Huang, A. Y. Ng, and C. D. Manning. Semi-supervised recursive autoencoders for predicting sentiment distributions. In EMNLP 11 Proceedings of the Conference on Empirical Methods in Natural Language Processing., 2011. 6