CS 294-1: Assignment 2 A Large-Scale Linear Regression Sentiment Model

CS 294-1: Assignment 2 A Large-Scale Linear Regression Sentiment Model Shaunak Chatterjee Computer Science Division University of California Berkeley, CA 94720 shaunakc@cs.berkeley.edu Abstract The primary objective of this assignment was to build a linear regression sentiment model based on amazon.com reviews. The main challenge comprised of handling moderately large amounts of data on a single machine. The different variations that I tried include the following: exact solution (L2 loss and ridge regularization), stochastic gradient with different training schemes and initialization, lasso regularization and unigram and bigram features. 1 Introduction Linear regression is a very popular approach to modeling the relationship between a response variable y and one or more explanatory variables x = {x 1, x 2,..., x p } using a linear model: ŷ = ˆβ 0 + p x j ˆβj j=1 ˆβ 0 is the intercept or bias of the model. This can also be handled by adding a constant x 0 = 1 to every x, in which case the formula simplifies to: ŷ = p x j ˆβj = x T ˆβ j=0 There are several variants of the linear regression model. The loss function that we wish to minimize (by learning an appropriate ˆβ) can be chosen based on the application. Popular choices are the L1 and L2 norm of the difference between y (the true value) and ŷ (the predicted value). For the L2-loss, there exists an exact solution: ˆβ = (X T X) 1 X T y In practice, this regression solution could be singular, especially if the features are dependent. The system actually has a linear space of solutions but the matrix X T X is not invertible because of singularity. We can get rid of this problem (with very high probability) by imposing additional constraints of minimizing the norm of ˆβ. If we minimize the L2-norm of ˆβ, it is called Ridge regression and the solution is: ˆβ = (X T X + λi) 1 X T y We can also minimize the L1-loss of ˆβ, and this is called Lasso regularization. This does not have a closed form solution, but can be solved by stochastic gradient methods. The rest of this report is structured as follows. Section 2 describes the data pre-processing issues. Section 3 and Section 4 provide detailed analysis of the classifier performance with unigrams and bigrams respectively and we conclude in Section 5. 2 Dataset and its pre-processing The task in this assignment was to build a linear regression sentiment model based on book reviews from www.amazon.com. The reviews (about 975,000 including duplicates) were collected by Mark Dredze et al at Johns Hopkins (available from http://www.cs.jhu.edu/~mdredze/ datasets/sentiment/). I used a binary file containing the XML tree representation of the reviews tokens.bin. I performed all my experiments for this assignment in MAT- LAB. 2.1 Parsing reviews from tokens.bin The XML tree for each review contained all the relevant information about it. The < rating > field contained a numerical score (1, 2, 4, or 5). The reviews corresponding to 3 were removed as they were deemed to be neutral (and hence uninformative!). The < review > token was used to identify where file information changed. The

< reviewer >, < title >, < review text >, < date > and < helpf ul > fields all contained information possibly relevant to the sentiment model. However, in this assignment, I have only used the < review text > field. This choice was based on what I felt most closely resembles the true world. We do not generally have access to any information other than the raw text. I started off by reading in from the binary stream entry by entry. This was painfully slow (would have required more than a day to go through all the reviews XML trees). Instead, when I read the binary input stream in blocks of 100, 000 entries, the process finished in less than half an hour! That was an important lesson learnt. The time taken was not very sensitive to the block size that I used. 2.2 Duplicate review removal Another artifact in this dataset was that there were a number of duplicate reviews. I used a hash function of the first 500 words (or the entire review if its length was smaller) to eliminate duplicates. MATLAB does not have any inbuilt HashSet, so I implemented one myself. My hash function was not very strong, so there were a few false duplicate detections. The number of unique reviews I am using are 515, 516. For my experiments, I divided my dataset into 10 partitions and consider a 9 : 1 train:test ratio. All review numbers (in the unique reviews list) having the same digit is considered to be one partition. 3 Experiments with Unigrams In this section, I will describe the different classifiers I implemented with unigram tokens. 3.1 Ridge regression with L2-loss The first classifier I implemented was the one with an L2- loss function and regularization. For this classifier, we can obtain the solution in closed form (as described in Section 1). Let us analyze each step in this algorithm: Computing X T X : X is a sparse feature matrix. Hence this matrix multiplication is quite fast. For dim(x) = 15000 460000, this step took between 30 to 40 seconds. Computing X T y : Much faster than the previous step Inverting X T X : This p p matrix is dense. Hence, inverting it takes a lot of time a few minutes in my case. This step is the main computational bottleneck. The largest matrix I could invert was for p = 20000, before I ran out of memory on 12GB RAM. The parameters of this classifier are the number of features used (p) and the regularization parameter λ. I used the p most frequent tokens as my features for p = 5000, 15000 and 20000. As expected, the test RMSE decreased with an increase in the number of features used. I also varied the regularization parameter λ, using the values 1, 10, 100 and 1000. Increasing the regularization parameter seemed to have a marginal positive impact on the performance. This suggests that the dataset favors an intercept-only model (for λ = ). The results are shown in Table 1. The most influential positive and negative words from this exact method are listed in Table 2 under the Exact column (these are results for p = 15000, λ = 100). #F eatures λ = 1 λ = 10 λ = 100 λ = 1000 5000 0.9669 0.9668 0.9663 0.9657 15000 0.9561 0.9553 0.9513 0.9515 20000 0.9590 0.9573 0.9496 0.9498 Table 1: Root Mean Squared Error (RMSE) on test data: Exact Ridge Regression (L2 error) with unigrams Positive Negative Exact St. Gr. Exact St. Gr. refreshingly best disappointing i funniest life unreadable book patron most waste but pleased history drivel was invaluable love poorly not excellent human useless just bravo knowledge tripe like awesome years laughable dont punches you worst no donovan us worthless author Table 2: Highest weighted positive and negative unigrams 3.2 Stochastic gradient descent The next thing I implemented was the stochastic gradient method. My first objective was to include more features in the model and see how that affected my error. All experiments in this section were run with p = 500000 features (the p most frequently occurring tokens were chosen). Initially, I stuck to the L2-loss and ridge regularization. 3.2.1 Armijo rule The initial runs of the stochastic gradient would either just hover around the initial RMSE (without reducing it) or just explode! Starting with a large step size and then gradually reducing it with increasing number of iterations

also did not help (I tried quite a few step size functions). So I finally settled on the adaptive Armijo rule to decide the stepsize. The Armijo rule is essentially a line search in m = {1, 2,..., } for α m such that f(x+α m f(x)) < cf(x), where c < 1 is the improvement you wish to have in each iteration. α is the base step size. I picked the best value for m {1, 10} at every step. I also experimented with different values of α. For α larger than.5, the process would diverge. For α less than.3, the process would almost always converge. 3.2.2 Training block samples The next issue was to choose a block size. In each iteration of the stochastic gradient descent, we update ˆβ based on a block of reviews. If the dataset is uniform, how we choose a block is immaterial. However, in this dataset, the distribution was not uniform. I tried two different block selection schemes. Firstly, I tried a sequential scanning scheme, where I based my update on a sequential chunk of 1000 reviews (from the training set). The other alternative that I tried was random sampling randomly selecting a set of reviews from the entire training set in every iteration. Figure 2: Receiver operating curve (ROC) for the different classifiers. The exact model with smaller number of features dominates. 3.2.3 Lift Scores The lift score is essentially a measure of how much better a classifier is, compared to a random classifier. We have reported the 1% lift scores. It is interesting to note that the lift scores for the positive class (i.e. positive reviews) is consistently better than the lift scores for the negative class (see Table 3). In our dataset, there were a lot more positive reviews than negative ones. Firstly, this results in a more confident positive classifier. Secondly, the test set also has more positive reviews, which in turn results in a relative high false positive rate for the negative classifier for a given true positive rate. 3.2.4 Better initialization Figure 1: RMSE convergence with iterations: Different training schemes The convergence rates for the two methods are shown in Figure 1. The random sampling clearly works better. This is also demonstrated in a better AUC score and higher lift scores (see Table 3). Unfortunately, the stochastic gradient descent method could not learn very effective classifiers (at least not as effective as the exact model with a much smaller number of features). This is reflected in the better AUC and lift scores of the exact method. Also, a look at the most influential words reveals that the gradient descent method is focusing on very non-intuitive or vague words (Table 2). The performance of the classifiers learnt by stochastic gradient were not as good as the exact model learnt previously with lesser number of features. The convergence patterns also seemed to suggest that a better initialization of ˆβ might help. An obvious initialization point was the ˆβ learnt from the exact method. Although this was only for a small portion of the new, much larger weight vector, this could prove useful. The experiments supported the intuition. The convergence rates were much better as were the AUC and lift score (see Table 3, Figure 1, Figure 2). 3.3 Lasso regularization The final thing I tried with unigrams was to change the regularization term from L2 (ridge) to L1 (lasso). This, of course, was convenient from an implementation standpoint since it was a single line change in the code for the stochastic gradient descent implementation. The convergence of the ridge and lasso regularization methods were almost identical as seen in Figure 3.

Method AUC Class 1% lift score Sequential Scan.8745 -ve 11.22 +ve 26.74 Random Scan.9047 -ve 18.90 +ve 27.54 Ridge - Better init.9111 -ve 19.87 +ve 32.27 Lasso - Better init.9103 -ve 20.92 +ve 29.65 Matrix inversion.9290 -ve 30.11 +ve 36.87 Table 3: AUC and Lift Scores of various models with unigrams enormous number of collisions. Instead we can choose the bin number by the following mapping: #bin(< token 1, token 2 >) = mod(token 1 token 2, K) This ensures a much more even distribution of the bigrams (in terms of their number of occurrences. To reduce the number of possible bigrams, I only considered bigrams where both tokens were among the 100, 000 most frequent unigram tokens. Once we have the hash set up, we give them token numbers as per a decreasing order of frequency (similar to the unigram numbering scheme). 4.2 Results Figure 3: Gradient descent convergence for Ridge and Lasso regularization 4 Experiments with bigrams The final thing that I tried was to run the exact method with bigram tokens instead of unigrams. Since bigrams capture more context than what their constituent unigrams do individually, this is a natural extension. For bigrams, I tried the exact method. I selected the 15000 most frequent bigrams as my features. Varying λ did not affect the results much (all reported results are with λ = 100). The AUC score was 0.9017. This was initially a surprise since it was smaller than the AUC for the unigram model. However, an inspection of the most influential positive and negative bigrams (see Table 4) indicates what went wrong. The top scoring unigrams (which were also very intuitive) did not make it into the list of most frequent bigrams, hence they were not considered as features. Positive enlightening and the funniest together they a moral not afraid Negative however in job with from page works on i became Table 4: Highest weighted positive and negative bigrams 4.1 Constructing bigram tokens As mentioned before, MATLAB does not have a HashSet implementation. In order to assign token numbers to the bigrams, I had to create a hash function for any possible bigram. A unique hash is ensured by the following mapping: < token 1, token 2 >: B token 1 + token 2 where B is the vocabulary size. However, if we place this hash value into K bins with a modular K operation, the results will be disastrous, since all bigrams ending with the same word, end up in the same bin and thus result in an 4.3 Measuring flops MATLAB has deprecated the flops function since it included LAPACK a few years ago. I looked online at other options of still doing it and came across a library by Tom Minka (http://research.microsoft.com/en-us/um/people/ minka/software/lightspeed/). The regular flops method does not quite work as advertised. The operation specific methods (I checked flops inv ) seem to work. But it did not make sense to quote flops of single operations could not figure out a meaningful way of combining the individual methods to come up with a composite flops value (would be curious to know if someone else did).

5 Conclusion Unfortunately, I underestimated the time it would take me for create the bigram tokens. Hence, I ran out of time to run a stochastic gradient optimization with a larger number of bigram tokens (which would have included the more expected bigrams). There is a small chance that there is some indexing issue in my bigram creation pipeline but it has passed all the sanity checks that I put it through. The biggest takeaway from this assignment for me was learning to deal with moderately large amounts of data with a reasonable amount of computational power (a single machine). The implementation optimizations including hash tables, sparse vectors and matrices and block updating of these sparse structures are important lessons for the future. Wishlist I did not have access to the matfile command in my version of MATLAB. That command could have possibly eased the initial data handling phase of this assignment. Acknowledgements The author would like to thank Aastha Jain for generously lending her powerful workstation. The author also acknowledges Mobin Javed and Anupam Prakash for several interesting discussions.