CS294-1 Assignment 2 Report Keling Chen and Huasha Zhao February 24, 2012 1 Introduction The goal of this homework is to predict a users numeric rating for a book from the text of the user s review. The original dataset contain 975194 Amazon book reviews. For each experiment, we perform 10-fold cross validation to estimate prediction error. 2 Problem Assume there are altogether K distinct users and I unique product reviewed in our dataset, label them u 1, u 2,..., u K and p 1, p 2,..., p I respectively. For individual user u k, denote the number of reviews s/he makes as n k. And further assume that there are in total N reviews across all different users and products in the corpus, where N = K k=1 n k. The review documents are labeled as D 1, D 2,..., D N, and we arrange the documents in the way that reviews from the same user stays together and in the incremental order of user labels. The vocabulary set (dictionary) of all the documents are denoted as V, and the dimension V = V. Each document is associated with a review score and a bag-of-word feature vector. The feature vector f n = (fn, 1 fn, 2..., fn V ) keeps track of the number of appearance of each individual word v V in document D n, and we would expect it is a sparse vector with positive integer values. Further let s = (s 1, s 2,..., s N ) T denote the score vector, so that the kth entry s k is the review score of document D k. Similarly, feature vectors of each document are stacked up to compose the documents matrix X = (f 1 ; f 2 ;... ; f N ). Notice that feature vector has the same dimension as the vocabulary set V, and X would be an N by V matrix. It would be interesting to predict review scores based on document feature vectors. We propose a linear prediction model that assigns weight to each word and the final scores are adjusted by user biases. Concretely, predicted scores ŝ can be calculated by, ŝ = Xw (1) where w is V dimensional weight vectors with each entry denoting the strength of determining the score of the corresponding word. 1
The problem now is to estimate parameter w given the dataset. Since most word would not contribute much to the final score, we expect that w has a sparse structure. Therefore, it is natural to consider the optimization problem to minimize the l 2 norm of the error, that is, 3 Methods 3.1 Part 1 min w,c ŝ s 2 (2) We try both Ridge and Lasso regularization to solve the above optimization problem. Exact solution is computed for Ridge and stochastic gradient decent is used to approximate Lasso, with different penalizing factor λ. They are considered as baseline algorithm in this paper. To further boost the prediction accuracy, we also add the following two features. 3.1.1 Reviewer Preference Some reviewers tend to give higher scores than the others given the same attitude towards the product. One more feature is added to each review to characterize this reviewer bias. Specifically, equation (1) is modified to, ŝ = Xw + c (3) where vector c is of the same size, representing customer biases. According to the order we arrange the review documents, we should have c 1 = c 2 = = c n1, c n1 +1 = c n1 +2 = = c n1 +n 2 and etc, because scores rated by the same user should have the same bias. 3.1.2 Rating Drift Rating scores are discrete in nature, and it is hard to say rating 4-star shows reviewer s attitude exactly in the middle of 3-star and 5-star. Given 1-star and 5-star represents two extremes of reviewers attitude towards a product, we introduce adjustments to reviews with 2 or 4 stars. The star drift is another private parameter for each individual reviewer. 3.2 Part 2 Considering there are duplicated reviews which might affect the prediction accuracy, we hash on the first 15 words of each review to label unique reviews and remove duplicated reviews. We get around 500,000 unique reviews. Feature words with low frequency and some stop words are removed to reduce the background noise. Finally we obtain a predictor matrix X with 502460 samples and 10999 features, and a response vector Y with 502460 rows. The dataset is randomized before 10-fold cross validation in order to reduce the bias that may 2
be caused by original ordering of the data. We approximate l2 loss function as shown in (4). In order to avoid overtting, we add an l2 norm regularizer to the loss function with the parameter λ > 0. β ridge = arg min β (Y Xβ β 0 ) T (Y Xβ β 0 ) + λ β 2 (4) We solve the above optimization problem by algorithm below using stochastic gradient decent. The procedure starts with β = 0, β R p. Then it updates iteratively every coordinate of the vector until convergence. At each iteration t, randomly choose a block of training data X b, Y b : 1) G = f β = XT b (Y b X b β) + λβ; 2) update β t+1 = β t α t G, where we optimize stepsize α t ateachiterationtbysolving min(y b X b (β t α t G t )) T (Y b X b (β t α t G t )) + λ β t α t G t 2 ; 3) Use the remaining training data to compute Root MSE as criterion. 4 Results 4.1 Part 1 4.1.1 Model Comparison Four models, baseline Ridge model (unigram), baseline Lasso model, baseline + reviewer preference, and baseline+ preference + star drift, are compared in this section. We use 10-fold cross validation and Root MSE as performance measures. Each model is tested with varying λ ranging from 0.1, 0.2, 0.5, 1.0, 2.0. The best performance λ is chosen for each model, and corresponding RMSE for the four models are plotted in Figure 1. We can see that l 2 penalizing term outperforms l 1, and the last model beat the others significantly. 3
1.25 1.2 1.15 Comparison of Best Cross Validation Performance(RMSE) baseline l1 baseline l2 include individual rating preference include individual rating preference and star drift 1.1 1.05 1 0.95 0.9 0.85 0.8 0.75 1 2 3 4 5 6 7 8 9 10 Figure 1: Model Comparisons: X-axis is 10-fold cross validation, Y-axis is RMSE. 4.1.2 ROC and Lift Score We further compare the performance of the full model and the model without star drift adjustment using ROC and Lift score measures. Even though the full model outperforms the others in RMSE measure, it does not show significant improvement in classifying sentiment polarization. This is illustrated in Figure 2 and 3. 4.1.3 Strongest Terms We use unigram throughout the experiment, the strongest positive and negative terms are listed in Table 1. 4
1 0.9 ROC plot Full Model Without Star Drift 0.8 0.7 sensitivities 0.6 0.5 0.4 0.3 0.2 0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 specificity Figure 2: ROC plot 5
90 80 Lift plot Full Model Without Star Drift 70 60 lift value 50 40 30 20 10 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 specificity Figure 3: Lift plot 6
Figure 4: Stochastic gradient descent training results Positive Negative Table 1: Terms with Strongest Weights Fascinating, Excellent, unsettling, Highly, adorable, Laden, punches, Brilliant, Wonderful, Loved poorly, Save, waste, disappointment, useless, disjointed, disappointing, unreadable, Sorry, drivel 4.2 Part 2 The data is split into 10 roughly equal-sized folds ( documents each), so that estimated prediction error (RMSE) is the average of the trials from 10-fold cross validation. We used word unigrams as the features. Here is an example of one training procedure. The RMSE decreases from 1.5176 to 0.9986 after 800 iteration. The RMSE of testing data for 10-fold cross validation is shown below. The average RMSE is 1.0081. 7
Figure 5: RMSE as prediction error for testing data by stochastic gradient descent 5 Discussion Based on our results, we conclude that the model including individual rating preference and star drift is better suited for capturing features and minimize prediction error. Although we focus on the unigrams to model reviews during stochastic gradient descent, we believe the same framework can benefit from modeling n-grams. We also tried different λ and different block size of training samples. It turn out that if the block size is too small, the descent results are not very good, probably because the randomness between small samples lead to big fluctuation of the gradient descent. 8