CS 294-1: Assignment 2 A Large-Scale Linear Regression Sentiment Model

Similar documents
CS294-1 Assignment 2 Report

Louis Fourrier Fabien Gaie Thomas Rolf

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Lecture 25: Review I

1 Training/Validation/Testing

Lasso. November 14, 2017

Learning Meanings for Sentences with Recursive Autoencoders

Predicting Popular Xbox games based on Search Queries of Users

The exam is closed book, closed notes except your one-page cheat sheet.

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati

Allstate Insurance Claims Severity: A Machine Learning Approach

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

FastText. Jon Koss, Abhishek Jindal

CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp

CS281 Section 3: Practical Optimization

Network Traffic Measurements and Analysis

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011

Constrained optimization

Support Vector Machines

I How does the formulation (5) serve the purpose of the composite parameterization

Kaggle See Click Fix Model Description

Predicting User Ratings Using Status Models on Amazon.com

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Building Fast Performance Models for x86 Code Sequences

CS294-1 Final Project. Algorithms Comparison

Programming Exercise 1: Linear Regression

Sentiment analysis under temporal shift

Logistic Regression: Probabilistic Interpretation

3 Types of Gradient Descent Algorithms for Small & Large Data Sets

Tutorials Case studies

HMC CS 158, Fall 2017 Problem Set 3 Programming: Regularized Polynomial Regression

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

Problems 1 and 5 were graded by Amin Sorkhei, Problems 2 and 3 by Johannes Verwijnen and Problem 4 by Jyrki Kivinen. Entropy(D) = Gini(D) = 1

311 Predictions on Kaggle Austin Lee. Project Description

CSE 546 Machine Learning, Autumn 2013 Homework 2

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Parallel Methods for Convex Optimization. A. Devarakonda, J. Demmel, K. Fountoulakis, M. Mahoney

Lecture #3: PageRank Algorithm The Mathematics of Google Search

Linear Regression Optimization

CS140 Final Project. Nathan Crandall, Dane Pitkin, Introduction:

Gradient LASSO algoithm

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2016

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

The Fly & Anti-Fly Missile

Lecture #11: The Perceptron

Evaluation. Evaluate what? For really large amounts of data... A: Use a validation set.

By Atul S. Kulkarni Graduate Student, University of Minnesota Duluth. Under The Guidance of Dr. Richard Maclin

Practice Questions for Midterm

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

PS 6: Regularization. PART A: (Source: HTF page 95) The Ridge regression problem is:

A Neuro Probabilistic Language Model Bengio et. al. 2003

Machine Learning: Think Big and Parallel

CS231A Course Project Final Report Sign Language Recognition with Unsupervised Feature Learning

Predict the box office of US movies

2. On classification and related tasks

Akarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

Recurrent Neural Networks. Nand Kishore, Audrey Huang, Rohan Batra

CS 179 Lecture 16. Logistic Regression & Parallel SGD

On The Value of Leave-One-Out Cross-Validation Bounds

Online Algorithm Comparison points

CSC 2515 Introduction to Machine Learning Assignment 2

Comparing Implementations of Optimal Binary Search Trees

CSE151 Assignment 2 Markov Decision Processes in the Grid World

06: Logistic Regression

CHAPTER 8 COMPOUND CHARACTER RECOGNITION USING VARIOUS MODELS

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.

Lecture on Modeling Tools for Clustering & Regression

CPSC 340: Machine Learning and Data Mining. Robust Regression Fall 2015

Solution 1 (python) Performance: Enron Samples Rate Recall Precision Total Contribution

Popularity of Twitter Accounts: PageRank on a Social Network

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

10-701/15-781, Fall 2006, Final

Perceptron: This is convolution!

More on Neural Networks. Read Chapter 5 in the text by Bishop, except omit Sections 5.3.3, 5.3.4, 5.4, 5.5.4, 5.5.5, 5.5.6, 5.5.7, and 5.

CSE 250B Assignment 2 Report

Announcements: projects

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Machine Learning / Jan 27, 2010

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Gradient Descent Optimization Algorithms for Deep Learning Batch gradient descent Stochastic gradient descent Mini-batch gradient descent

Youtube Graph Network Model and Analysis Yonghyun Ro, Han Lee, Dennis Won

Salford Systems Predictive Modeler Unsupervised Learning. Salford Systems

Ensemble methods in machine learning. Example. Neural networks. Neural networks

University of Wisconsin-Madison Spring 2018 BMI/CS 776: Advanced Bioinformatics Homework #2

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

Linear Methods for Regression and Shrinkage Methods

CS229 Final Project: Predicting Expected Response Times

3D model classification using convolutional neural network

Robust Regression. Robust Data Mining Techniques By Boonyakorn Jantaranuson

Divide and Conquer Kernel Ridge Regression

Programming Exercise 4: Neural Networks Learning

INTRODUCTION TO MACHINE LEARNING. Measuring model performance or error

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Machine Learning and Computational Statistics, Spring 2016 Homework 1: Ridge Regression and SGD

Big Data Analytics CSCI 4030

Lecture 1 Notes. Outline. Machine Learning. What is it? Instructors: Parth Shah, Riju Pahwa

Shingling Minhashing Locality-Sensitive Hashing. Jeffrey D. Ullman Stanford University

Excerpt from "Art of Problem Solving Volume 1: the Basics" 2014 AoPS Inc.

Transcription:

CS 294-1: Assignment 2 A Large-Scale Linear Regression Sentiment Model Shaunak Chatterjee Computer Science Division University of California Berkeley, CA 94720 shaunakc@cs.berkeley.edu Abstract The primary objective of this assignment was to build a linear regression sentiment model based on amazon.com reviews. The main challenge comprised of handling moderately large amounts of data on a single machine. The different variations that I tried include the following: exact solution (L2 loss and ridge regularization), stochastic gradient with different training schemes and initialization, lasso regularization and unigram and bigram features. 1 Introduction Linear regression is a very popular approach to modeling the relationship between a response variable y and one or more explanatory variables x = {x 1, x 2,..., x p } using a linear model: ŷ = ˆβ 0 + p x j ˆβj j=1 ˆβ 0 is the intercept or bias of the model. This can also be handled by adding a constant x 0 = 1 to every x, in which case the formula simplifies to: ŷ = p x j ˆβj = x T ˆβ j=0 There are several variants of the linear regression model. The loss function that we wish to minimize (by learning an appropriate ˆβ) can be chosen based on the application. Popular choices are the L1 and L2 norm of the difference between y (the true value) and ŷ (the predicted value). For the L2-loss, there exists an exact solution: ˆβ = (X T X) 1 X T y In practice, this regression solution could be singular, especially if the features are dependent. The system actually has a linear space of solutions but the matrix X T X is not invertible because of singularity. We can get rid of this problem (with very high probability) by imposing additional constraints of minimizing the norm of ˆβ. If we minimize the L2-norm of ˆβ, it is called Ridge regression and the solution is: ˆβ = (X T X + λi) 1 X T y We can also minimize the L1-loss of ˆβ, and this is called Lasso regularization. This does not have a closed form solution, but can be solved by stochastic gradient methods. The rest of this report is structured as follows. Section 2 describes the data pre-processing issues. Section 3 and Section 4 provide detailed analysis of the classifier performance with unigrams and bigrams respectively and we conclude in Section 5. 2 Dataset and its pre-processing The task in this assignment was to build a linear regression sentiment model based on book reviews from www.amazon.com. The reviews (about 975,000 including duplicates) were collected by Mark Dredze et al at Johns Hopkins (available from http://www.cs.jhu.edu/~mdredze/ datasets/sentiment/). I used a binary file containing the XML tree representation of the reviews tokens.bin. I performed all my experiments for this assignment in MAT- LAB. 2.1 Parsing reviews from tokens.bin The XML tree for each review contained all the relevant information about it. The < rating > field contained a numerical score (1, 2, 4, or 5). The reviews corresponding to 3 were removed as they were deemed to be neutral (and hence uninformative!). The < review > token was used to identify where file information changed. The

< reviewer >, < title >, < review text >, < date > and < helpf ul > fields all contained information possibly relevant to the sentiment model. However, in this assignment, I have only used the < review text > field. This choice was based on what I felt most closely resembles the true world. We do not generally have access to any information other than the raw text. I started off by reading in from the binary stream entry by entry. This was painfully slow (would have required more than a day to go through all the reviews XML trees). Instead, when I read the binary input stream in blocks of 100, 000 entries, the process finished in less than half an hour! That was an important lesson learnt. The time taken was not very sensitive to the block size that I used. 2.2 Duplicate review removal Another artifact in this dataset was that there were a number of duplicate reviews. I used a hash function of the first 500 words (or the entire review if its length was smaller) to eliminate duplicates. MATLAB does not have any inbuilt HashSet, so I implemented one myself. My hash function was not very strong, so there were a few false duplicate detections. The number of unique reviews I am using are 515, 516. For my experiments, I divided my dataset into 10 partitions and consider a 9 : 1 train:test ratio. All review numbers (in the unique reviews list) having the same digit is considered to be one partition. 3 Experiments with Unigrams In this section, I will describe the different classifiers I implemented with unigram tokens. 3.1 Ridge regression with L2-loss The first classifier I implemented was the one with an L2- loss function and regularization. For this classifier, we can obtain the solution in closed form (as described in Section 1). Let us analyze each step in this algorithm: Computing X T X : X is a sparse feature matrix. Hence this matrix multiplication is quite fast. For dim(x) = 15000 460000, this step took between 30 to 40 seconds. Computing X T y : Much faster than the previous step Inverting X T X : This p p matrix is dense. Hence, inverting it takes a lot of time a few minutes in my case. This step is the main computational bottleneck. The largest matrix I could invert was for p = 20000, before I ran out of memory on 12GB RAM. The parameters of this classifier are the number of features used (p) and the regularization parameter λ. I used the p most frequent tokens as my features for p = 5000, 15000 and 20000. As expected, the test RMSE decreased with an increase in the number of features used. I also varied the regularization parameter λ, using the values 1, 10, 100 and 1000. Increasing the regularization parameter seemed to have a marginal positive impact on the performance. This suggests that the dataset favors an intercept-only model (for λ = ). The results are shown in Table 1. The most influential positive and negative words from this exact method are listed in Table 2 under the Exact column (these are results for p = 15000, λ = 100). #F eatures λ = 1 λ = 10 λ = 100 λ = 1000 5000 0.9669 0.9668 0.9663 0.9657 15000 0.9561 0.9553 0.9513 0.9515 20000 0.9590 0.9573 0.9496 0.9498 Table 1: Root Mean Squared Error (RMSE) on test data: Exact Ridge Regression (L2 error) with unigrams Positive Negative Exact St. Gr. Exact St. Gr. refreshingly best disappointing i funniest life unreadable book patron most waste but pleased history drivel was invaluable love poorly not excellent human useless just bravo knowledge tripe like awesome years laughable dont punches you worst no donovan us worthless author Table 2: Highest weighted positive and negative unigrams 3.2 Stochastic gradient descent The next thing I implemented was the stochastic gradient method. My first objective was to include more features in the model and see how that affected my error. All experiments in this section were run with p = 500000 features (the p most frequently occurring tokens were chosen). Initially, I stuck to the L2-loss and ridge regularization. 3.2.1 Armijo rule The initial runs of the stochastic gradient would either just hover around the initial RMSE (without reducing it) or just explode! Starting with a large step size and then gradually reducing it with increasing number of iterations

also did not help (I tried quite a few step size functions). So I finally settled on the adaptive Armijo rule to decide the stepsize. The Armijo rule is essentially a line search in m = {1, 2,..., } for α m such that f(x+α m f(x)) < cf(x), where c < 1 is the improvement you wish to have in each iteration. α is the base step size. I picked the best value for m {1, 10} at every step. I also experimented with different values of α. For α larger than.5, the process would diverge. For α less than.3, the process would almost always converge. 3.2.2 Training block samples The next issue was to choose a block size. In each iteration of the stochastic gradient descent, we update ˆβ based on a block of reviews. If the dataset is uniform, how we choose a block is immaterial. However, in this dataset, the distribution was not uniform. I tried two different block selection schemes. Firstly, I tried a sequential scanning scheme, where I based my update on a sequential chunk of 1000 reviews (from the training set). The other alternative that I tried was random sampling randomly selecting a set of reviews from the entire training set in every iteration. Figure 2: Receiver operating curve (ROC) for the different classifiers. The exact model with smaller number of features dominates. 3.2.3 Lift Scores The lift score is essentially a measure of how much better a classifier is, compared to a random classifier. We have reported the 1% lift scores. It is interesting to note that the lift scores for the positive class (i.e. positive reviews) is consistently better than the lift scores for the negative class (see Table 3). In our dataset, there were a lot more positive reviews than negative ones. Firstly, this results in a more confident positive classifier. Secondly, the test set also has more positive reviews, which in turn results in a relative high false positive rate for the negative classifier for a given true positive rate. 3.2.4 Better initialization Figure 1: RMSE convergence with iterations: Different training schemes The convergence rates for the two methods are shown in Figure 1. The random sampling clearly works better. This is also demonstrated in a better AUC score and higher lift scores (see Table 3). Unfortunately, the stochastic gradient descent method could not learn very effective classifiers (at least not as effective as the exact model with a much smaller number of features). This is reflected in the better AUC and lift scores of the exact method. Also, a look at the most influential words reveals that the gradient descent method is focusing on very non-intuitive or vague words (Table 2). The performance of the classifiers learnt by stochastic gradient were not as good as the exact model learnt previously with lesser number of features. The convergence patterns also seemed to suggest that a better initialization of ˆβ might help. An obvious initialization point was the ˆβ learnt from the exact method. Although this was only for a small portion of the new, much larger weight vector, this could prove useful. The experiments supported the intuition. The convergence rates were much better as were the AUC and lift score (see Table 3, Figure 1, Figure 2). 3.3 Lasso regularization The final thing I tried with unigrams was to change the regularization term from L2 (ridge) to L1 (lasso). This, of course, was convenient from an implementation standpoint since it was a single line change in the code for the stochastic gradient descent implementation. The convergence of the ridge and lasso regularization methods were almost identical as seen in Figure 3.

Method AUC Class 1% lift score Sequential Scan.8745 -ve 11.22 +ve 26.74 Random Scan.9047 -ve 18.90 +ve 27.54 Ridge - Better init.9111 -ve 19.87 +ve 32.27 Lasso - Better init.9103 -ve 20.92 +ve 29.65 Matrix inversion.9290 -ve 30.11 +ve 36.87 Table 3: AUC and Lift Scores of various models with unigrams enormous number of collisions. Instead we can choose the bin number by the following mapping: #bin(< token 1, token 2 >) = mod(token 1 token 2, K) This ensures a much more even distribution of the bigrams (in terms of their number of occurrences. To reduce the number of possible bigrams, I only considered bigrams where both tokens were among the 100, 000 most frequent unigram tokens. Once we have the hash set up, we give them token numbers as per a decreasing order of frequency (similar to the unigram numbering scheme). 4.2 Results Figure 3: Gradient descent convergence for Ridge and Lasso regularization 4 Experiments with bigrams The final thing that I tried was to run the exact method with bigram tokens instead of unigrams. Since bigrams capture more context than what their constituent unigrams do individually, this is a natural extension. For bigrams, I tried the exact method. I selected the 15000 most frequent bigrams as my features. Varying λ did not affect the results much (all reported results are with λ = 100). The AUC score was 0.9017. This was initially a surprise since it was smaller than the AUC for the unigram model. However, an inspection of the most influential positive and negative bigrams (see Table 4) indicates what went wrong. The top scoring unigrams (which were also very intuitive) did not make it into the list of most frequent bigrams, hence they were not considered as features. Positive enlightening and the funniest together they a moral not afraid Negative however in job with from page works on i became Table 4: Highest weighted positive and negative bigrams 4.1 Constructing bigram tokens As mentioned before, MATLAB does not have a HashSet implementation. In order to assign token numbers to the bigrams, I had to create a hash function for any possible bigram. A unique hash is ensured by the following mapping: < token 1, token 2 >: B token 1 + token 2 where B is the vocabulary size. However, if we place this hash value into K bins with a modular K operation, the results will be disastrous, since all bigrams ending with the same word, end up in the same bin and thus result in an 4.3 Measuring flops MATLAB has deprecated the flops function since it included LAPACK a few years ago. I looked online at other options of still doing it and came across a library by Tom Minka (http://research.microsoft.com/en-us/um/people/ minka/software/lightspeed/). The regular flops method does not quite work as advertised. The operation specific methods (I checked flops inv ) seem to work. But it did not make sense to quote flops of single operations could not figure out a meaningful way of combining the individual methods to come up with a composite flops value (would be curious to know if someone else did).

5 Conclusion Unfortunately, I underestimated the time it would take me for create the bigram tokens. Hence, I ran out of time to run a stochastic gradient optimization with a larger number of bigram tokens (which would have included the more expected bigrams). There is a small chance that there is some indexing issue in my bigram creation pipeline but it has passed all the sanity checks that I put it through. The biggest takeaway from this assignment for me was learning to deal with moderately large amounts of data with a reasonable amount of computational power (a single machine). The implementation optimizations including hash tables, sparse vectors and matrices and block updating of these sparse structures are important lessons for the future. Wishlist I did not have access to the matfile command in my version of MATLAB. That command could have possibly eased the initial data handling phase of this assignment. Acknowledgements The author would like to thank Aastha Jain for generously lending her powerful workstation. The author also acknowledges Mobin Javed and Anupam Prakash for several interesting discussions.