Amazon Review Rating Prediction with Text-Mining, Latent-Factor Model and Restricted Boltzmann Machine

Amazon Review Rating Prediction with Text-Mining, Latent-Factor Model and Restricted Boltzmann Machine Cheng Guo A53201515 guochengc422@gmail.com Zhichen Wu A53214514 zhw278@eng.ucsd.edu Juncheng Liu A53223244 jul310@eng.ucsd.edu Linghao Zhu A53203446 liz177@eng.ucsd.com Abstract For electronic commerce companies, in order to make recommendations to users, they must first make prediction of how a user will respond to a new product. To do so, they should find out the preference of each user as well as the features of each product. Therefore, the task to predict the rating from the review information is a crucial task. In this paper, we adopt three methods to accomplish the task of rating prediction, one is text mininig approach with the review text information, another one is latent-factor model and the other one is the RBM(Restricted Boltzmann Machine). In our experiments, we compare the performance of these three models on the Amazon Review Datasets of different product categories and find that for datasets with different features, the performance of these models varies. Through comparison, we find that for datasets with dense user-item pairs(all users and items have at least several reviews), the latent-factor model could performs quite well. For datasets with enough review text information, the text-mining method shows strong prediction ability. And RBM is an approach with great potential that worth further exploration and research. 1 Introduction The goal of our project is to predict ratings from review information. Online reviews play a crucial role for users to decide between products. They are extensively used for movies, on online shopping sites, restaurant, etc. Most platforms allow users to submit a text review as well as a numeric rating. We implement a number of methods to predict ratings for the Amazon Review Dataset including the text-mining, latent factor model and RBM(Restricted Boltzmann Machine). These models are relatively simple, but could often have good performance in practice. Also as we notice that the performance of these models vary with different dataset with different features, we find out for specific dataset which model is the perfect solution. Specifically, as we know that latent-factor model could perform well on the dataset with dense user-item pairs, we compress the dataset step by step and explore the performance of each model. 2 Dataset Description The dataset we use is the Amazon Review Dataset crawled in [2] spanning May 1996 - July 2014, which contains approximately 35 million reviews totally. And this dataset is further divided into 26 parts based on the top-level category of each product (e.g. books, movies). 1

2.1 Basic Statistics and Property We choose the preprocessed dense dataset with 5-core where each of the remaining users and items have 5 reviews each. In our experiment, in order to compare the performance of the models on different category dataset, we choose 3 categories of similar dataset size, i.e. Video Games, Health and Personal Care, and Beauty. A summary of the dataset is shown in the following table. Cagetory #Reviews #Users #Items #Vocabulary #Words avg #Words Video Games 231780 24303 10672 507742 47.6M 205 Health & Personal Care 346355 38609 18534 314105 32.7M 94 Beauty 198502 22363 12101 162539 17.6M 88 Table 1 Dataset statistics (number of users; number of items;number of reviews; vocabulary size; total number of words; average number of words per review) We could find that in this dataset, the vocabulary and words are quite rich so that text-mining method could be ultilized to extract significant information for the rating prediction task. Also, we could tell that each user has writen 10 reviews and each items has been reviewed 20 times on average so that the user-item pair are quite dense in this dataset where latent-factor model could perform quite well. And for each review, the specific format is as follows. reviewerid - ID of the reviewer, e.g. A2SUAM1J3GNN3B asin - ID of the product, e.g. 0000013714 reviewername - name of the reviewer helpful - helpfulness rating of the review, e.g. 3 of 5 reviewtext - text of the review overall - rating of the product summary - summary of the review unixreviewtime - time of the review (unix time) reviewtime - time of the review (raw) 2.2 Exploratory Analysis And for the exploratory analysis, we first explore the rating distribution of the dataset which is shown as the following figures. (a) Health & Personal Care (b) Video Games Figure 1: Rating Distribution (c) Beauty And from the distribution, we could find that most of the ratings fo all three categories are quite high, where 5-star rating reviews count for almost half all the reviews. Therefore, focusing on how to recognize the texture features in the negative reviews would definitely help the text-mining model to improve the rating prediction performance. 2

Also we explores the density of the user-item pair in the dataset. Specifically, we figure out the user and item distribution for each other. (a) Health & Personal Care (b) Video Games (c) Beauty Figure 2: Item Distribution for Users (a) Health & Personal Care (b) Video Games (c) Beauty Figure 3: User Distribution for Items As we could tell from the above figures, as we choose the preprocessed 5-core dataset, for each item there are at least 5 users reviewing it and vice versa, which is far much denser than the original raw review dataset. Therefore, we think that the latent-factor model could be adopted for this kind of dense dataset. Also we make hypothesis that if we compress the data(increase the k-core index) more aggresively, the performance of the model might improve, we would prove this in the following experiments. 3 Predictive Task Identification Our main prediction task is to predict the rating score from the given review information with different models on different dataset. With text mining method and latent-factor model, this can be framed as a regression problem where the ratings are just continuous from 1 to 5. And with RBM model, this problem is transformed into a clssification problem where ratings are intergers from 1 to 5 which can be viewed as 5 different classes. Also, we are interested in the comparison of performance of different models on different datasets. Specifically, we compress the dataset for user-item pair by increasing the k-core index so that only the users and items with large number of reviews are kept in the dataset, which make the dataset more dense. Then we explore how the performance of different models would change with the compression of dataset. 3.1 Evaluation of Model For prediction problem, we mainly adopt MSE(Mean Square Error) as our metric to evaluate the performance of our model. Also we would consider the effect of data size on the performance of the prediction. Furthermore, for the text-mining model, we would extract the most representitive words with highest or lowest weight out of the vocabulary in the positive reviews and negative reviews for each product category and justify whether these words make sense or not. For each category of dataset, we randomly select 80% as training set and the rest 20% as testing set. 3

3.2 Relevant Baseline Average rating: Here the most simple baseline system is by taking the average across all training ratings in the dataset. In terms of the MSE, this is the best possible constant predictor so that we could use as the baseline system. 3.3 Data Preprocess For the text-mining model, the features extracted from the data are the text features. Specifically, we adopt the bag-of-words model with TF-IDF weighted scheme which would be explained in the latter section. To implement the TF-IDF feature extraction, we adopt the TfidfVectorizer module in sklearn which first removes the punctuations and stopwords from the raw review data and then calculates the TF-IDF score of each review. And for the latent-factor and RBM model, the only information we need is the rating-user-item triple, which could be easily extracted from the raw dataset. And for the experiment on different dataset when we compress the dataset by increasing the k- core index, we iteratively remove these reviews in the dataset where the number of users or items less than the threshold k until there s no change in the dataset. The original 5-core data has already contains the data with k=5. Then we further compress the data by setting k=7,9,11,13 to get 5 different dataset for each category. And the summary of the preprocessed dataset is as follows. K-Core #Reviews #Users #Items #Vocabulary Health & Personal Care 5 346355 38609 18534 314105 7 129642 8965 5330 192441 9 60902 2632 1449 133376 11 52160 1961 1181 120290 13 46651 1595 1070 728209 Video Games 5 231780 24303 10672 507742 7 13060 9808 5641 375023 9 71184 4212 2928 263050 11 35891 1810 1466 171557 13 6850 330 307 59867 Beauty 5 198502 22363 12101 162539 7 60276 4322 2423 85674 9 30818 1531 768 60259 11 26983 1197 693 55156 13 23352 949 624 49874 Table 2 K-core Dataset statistics (number of users; number of items;number of reviews;vocabulary size) From the summary we could tell that with the compression of the dataset, number of reviews, users, items and vocabulary all drop dramatically. And the density of the user-item pair increases with the compression. 4 Model Design and Description In this section, we describe in detail the three methods we adopt for the rating prediction task and the motivation for design the models. 4

4.1 Latent Factor Model We first ignore the review text and try predicting the rating only based on the userid and itemid. In this senario, Latent Factor Model is intuitively a solution. We predict the rating based on the following formula: r u,i = α + β u + β i + γ u γ i (1) We use mean square error to measure our model. In addition, to prevent overfitting, we add L2 regularizations to control the model complexity. Since α is a base estimation, we won t penalize on it. And since β and γ have different dimensions and probably different magnitudes, we use different coefficients to penalize them. So the loss can be calculated as: E = (α + β u + β i + γ u γ i R u,i ) 2 train + λ β ( βu 2 + βi 2 ) + λ γ ( γ u 2 2 + u i u i γ i 2 2) (2) Following the loss definition, we can take derivetives on it and update α, β and γ accordingly until convergence. In addition, different categories should have different distributions of ratings, so applying multiple models respectively is a better choice. 4.1.1 Optimization Besides applying different models, we can also incorporate category information into Latent Factor Model. Inspired by incorporating user information, we associate ρ c, which is the latent factor for category c, with γ i and multiply them together with γ u. So the prediction will be changed to: r u,i = α + β u + β i + γ u (γ i + C A i (c)ρ c ) (3) in which C is the total number of categories (in our case it is 3), and A i is an one-hot vector in which A i (c) = 1 means that item i belongs to category c. Thus, the loss is changed to: c=1 E = train + λ β ( u ( α + β u + β i + γ u (γ i + c β 2 u + i β 2 i ) + λ γ ( u ) 2 γ u 2 2 + i γ i 2 2) + λ ρ ρ c 2 2 c (4) To minimize the loss, we take dirivative on all parameters, which gives us: E α = 2 train ( α + β u + β i + γ u (γ i + c E = 2 ( α + β u + β i + γ u (γ i + β u i I u c E = 2 ( α + β u + β i + γ u (γ i + β i u U i c ) ) + 2λ β β u ) + 2λ β β i (5) 5

For these three parameters, we can optimize them by equalizing them to zeros and solve the equations. E = 2 ( α + β u + β i + γ u (γ i + γ u i I u c E = 2 ( α + β u + β i + γ u (γ i + γ i u U i c E = 2 ρ c train ( α + β u + β i + γ u (γ i + c )( γ i + c )γ u + 2λ γ γ i )γ u A i (c) + 2λ ρ ρ c A i (c)ρ c ) + 2λ γ γ u (6) For these three parameters, we can optimize them by gradient descent on the full batch of data. However, simply combining them from the beginning sometimes leads to bad direction. So to achieve better local minimum, we first update α and β until convergence, then update γ and ρ until convergence, finally update all parameters except α until convergence. 4.2 Restricted Boltzmann Machine Boltzmann Machine is a generative stochastic neural network that can learn a probability distribution over its set of inputs. A Restricted Boltzmann Machine restricts its connectivity by allowing only one hidden layer and no edges between hidden units. By summing over the states of hidden units together with the weights, we can get the probability distribution over the visible units. Then the output can be sampled based on that probability. However, traditional RBM cannot solve the problem of rating prediction because of its binary states and the missing rating data. So to deal with it, we have to apply the RBM according to Salakhutdinov [4]. In this paper, RBM is modified to using softmax visible units. Moreover, it constructs different RBM model for different users, while sharing the weights between hidden units and the visible unit for all the users who have rated that certain visible unit. Also, unrated visible units are disconnected with hidden units. Unfortunately, we are unable to completely replicate the work in that paper. So the performance is quite limited. 4.3 Text Mining Approach As there are rich text information in the review text, we try to adopt the text mining apporach for the rating prediction task. For text mining approach, we extract the features from the review text, specifically the tf-idf weight for each unigram in the vocabulary. Typically, the tf-idf weight is composed by two terms: the first computes the normalized Term Frequency (TF), aka. the number of times a word appears in a document, divided by the total number of words in that document; the second term is the Inverse Document Frequency (IDF), computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears. And due to the large amount of vocabulary, the feature matrix extracted with TF-IDF weight is just huge and sparse so that the dimension reduction methods like PCA are not feasible plans. Also, as the feature vector is too sparse, some other features like the helpfulness and time have negligible effect on the overall performance of regression, which we choose to discard for this task. And after the feature extraction, we perform the regression with the SVR(Supporting Vector Regression) model. The model produced by support vector classification (as described above) depends only on a subset of the training data, because the cost function for building the model does not care about training points that lie beyond the margin. Analogously, the model produced by SVR depends only on a subset of the training data, because the cost function for building the model ignores any training data close to the model prediction. And a linear SVR minimizes 1 2 w 2 + C l (χ i + χ i ) i=1 6

subject to y i < w, x i > b ɛ + χ i < w, x i > +by i ɛ + χ i χ i, χ i 0 where C is a penalty parameter, ɛ the insensitive tube parameter. We then perform a grid search for these hyper-parameters. Due to the scaling issue, we randomly select only 50K samples from the dataset and use 3-fold cross-validation to determine the hyper-parameter and finally choose C = 1 and ɛ = 0.2 as the best option. We ve tried the linear kernel and rbf kernel and found that linear kernel performs better. As we introduce the penalty parameter C which is a regularization term, the overfitting problem is alleviated. The strength of text mining method is that it makes fully advantage of the text information in the review. However the text mining requires a large amount of text data to train a descent model which make correct prediction. 4.4 Model Comparison The three models we applied in this task have their strength and weakness respectively. For Latent Factor Model, it can deal with pure rating data without any assisstance from other information. So it is the most general model for this task. However, its performance might be highly related to the density of the rating matrix. Once the matrix is too sparse, it can barely predict nothing but an average rating. For RBM, it almost share the same strength and weakness as Latent Factor Model. In addidion, it can take advantage of its hidden layer to explore more latent information. But RBM is too hard to implement and even harder to improve by either tuning the parameters or change its network structure. For text mining method, it directly explore the information from review text, which is actually a huge advantege if there is such information along with the rating. Nevertheless, it might suffer from no sufficient data. That is, if we only have a few review text, the distribution of words as well as the expression of words cannot be close to the real world situation. 5 Related Work For the Amazon review rating prediction task, several previous related works have been explored for better performance. This Amazon review dataset is crawled from the Amazon website and widely used in the research of text mining and latent-factor model to solve the problem of recommender systems. Therefore, the state-of-the-art methods currently employed to study this problem are text mining methods and latent-factor model. 5.1 Latent-Factor Model For the latent-factor model, the basic idea is to adopt the user-item pair with its rating and construct a model to learn the latent dimensions for the rating prediction task. The feasibility of this model is build on the large quantity of user-item pair rating data where we have enough observation of the specific user or item. To overcome the cold-start problem, some related works have explored approaches to combine the information in the review text with the rating information[2] [1] so as to alleviate the cold-start problem and equip the model with better interpretability. In the first one[2], latent rating dimensions (such as those of latent-factor recommender systems) are combined with latent review topics (such as those learned by topic models like LDA). And in the second one[1], it propose a novel method to combine content-based filtering seamlessly with collaborative filtering, modeling the reviews and ratings simultaneously. 7

5.2 Restricted Boltzmann Machine In paper [4], Salakhutdinov shows how to use Restricted Boltzmann Machine to model tabular data. By adding constraints like sharing weights and disconnected edges, they are able to extend the application of RBM to users ratings prediction problems. They also derive efficient learning rules and inference procedures for their model so that the performance can be further improved. Finally, they demonstrate that applying RBMs on Netflix data set can reduce the RMSE by 0.005 and even more when multiple RBM models and multiple SVD models are linearly combined. 5.3 Text Mining For the text-mining method, the basic idea is to predicts product ratings by harnessing the information present in review text which this is especially helpful for new products and users, who may have too few ratings to model their latent factors, yet may still provide substantial information from the text of even a single review. The most intuitive approach with this method is to adopt the N-grams model with TF-IDF feature extraction which is presented in our experiment in the previous sections. This approach is usually adopted as the baseline system for comparison with further improvement. For instance in the paper paper of Qu [3], the results of the baseline system with N-grams model is quite similar to our experiments results, which justifies the feasibility of our model selection. But to make improvement, the concept of Bag-of-Opinions is introduced in this paper where an opinion, within a review, consists of three components: a root word, a set of modifier words from the same sentence, and one or more negation words. Each opinion is assigned a numeric score which is learned, by ridge regression. This method overcomes the sparsity problem in the N-grams model and performs better than the naive N-grams model. 6 Experiment Results and Conclusion 6.1 Latent Factor Model Latent Factor Model can be easily infulenced by the density of the dataset. If the dataset is too sparse, a new (user,item) pair cannot be precisely predicted because the given information is not enough to support the bias calculation. So we first conduct an experiment to show the relation between performances and the density of the dataset. In this experiment, we set the length for γ as 5, λ β = 4 and λ γ = 10 for category video game, λ β = 6 and λ γ = 12 for category health, and λ β = 6 and λ γ = 12 for category beauty. Figure 4: Accuracies vs. minimum numbers of items/users per user/item It can be seen from the figure above that the MSEs go smaller with the minimum numbers of items/users per user/item become larger in each category. From this aspect, Latent Factor Model does improve with higher density. 8

We also conduct an experiment to demonstrate the difference of model with and without γ. The MSEs of the three categories over different minimum number of items/users per user/item are shown in the following table. Table 3: Comparison of MSEs with and without γ category min # without γ with γ 5 1.10226624779 1.10130206940 7 1.03449730014 1.03226650941 video game 9 0.96446515868 0.96236534211 11 0.93984131191 0.93604523502 13 0.89261020996 0.88768704477 5 1.06213995319 1.06202492522 7 0.84962904415 0.84894890012 health 9 0.73195039429 0.72994227025 11 0.72845785356 0.72587126763 13 0.72133358258 0.71843810841 5 1.16701007071 1.16671213344 7 0.91563682339 0.91419769846 beauty 9 0.71263453373 0.71088355486 11 0.69146409242 0.68969065324 13 0.69116775963 0.68943366486 It can be seen that including γ does imporve the performance, although it s relatively trival. That means there exists some latent factors lying beneath the rating data, and they be expressed by some SVD-like factorization. Besides the basic model, we also modify it by incorporating category information so that dataset with mixed categories can be less universal. By mixing the datasets of the three categories and leaving only those with at least 9 items/users, we get a new mixed dataset. By applying the basic model as well as the improved one, with λ β = 5 and λ γ = 10 and λ ρ = 5, we get MSEs as 0.82181075125 and 0.82144958746 respectively. So there is a tiny improvement, which proves the feasibility of incorporating category informations. In addition, since this imrovement is far less significant than using seperate models, we can infer that the difference between categories are too large to be covered by ρ only. So using totally different αs, βs and γs is better. 6.2 Restricted Boltzmann Machine Because RBM is implemented based on matrix, we cannot apply it on the original dataset. So we only conduct experiments on ones with at least 7 related items/users. Here we set the number of hidden units as 100, the epoch number as 5, and the batch size as 500, learning rate as 0.1, and momentum as 0.5. The MSEs are 1.1652970920770037, 0.9962176586621301, and 1.0767795450895149 for the category video game, health, and beauty respectively. So it can be seen that direcly applying RBM has very poor performance without the other optimization methods mentioned in the paper. 6.3 Text Mining For implementation of this model, we first calculate the TF-IDF weighted index with the TfidfVectorizer module in sklearn. Then for the SVR model, we directly adopt the LinearSVR module in sklearn which set the hyper-parameter C = 1 and ɛ = 0.2. For text mining method, we extract the TF-IDF features from the dataset and adopt the SVR model for different datasets. The comparison of our method with the baseline method is in the following figure. 9

Figure 5 MSE for different Datasets From the figure we could tell that our method could beat the baseline method by almost 40%. And for both the baseline method and our method, as the data being compressed the MSE decreases. Through our analysis, we think that this result is due to the higher quality of the review text when the dataset is compressed. When the users and items with large number of reviews are left in the dataset, although the size of the training data decreases, these reviews are usually of high quality where we could extract richer text information and thus make more accurate rating predictions. Also, we notice that the MSE seems to increase a little bit when we compress the dataset too aggressively. This may be explained by the fact that when the dataset is not large enough to provide plenty of text information for training, the performance of the text mining model would be negatively affected. Also, for the interpretation of our text model, we extract the words with the highest weight and lowest weight in the SVR model for each category to explain why the review text could effect the ratings of the reviews. (a) Health & Personal Care (b) Video Games (c) Beauty Figure 6: Positive Words in Review Text From the positive words, we could see some universal words that appears in all the categories like amazed, best, great. Also there are words actually make sense for each category. For instance, in the health and personal care category, the positive words are nutritious, delicious, maintenance. In the video games category, the positive words are preinstalled, plausible, holy, and for the beauty category, the positive words are enriching, repurchase and relaxing. 10

(a) Health & Personal Care (b) Video Games (c) Beauty Figure 7: Negative Words in Review Text And for negative words, some universal words like worst, disappointing and trash appears in all the categories. And in the health category, inconvenient, ineffective and flimsy are keywords for negative reviews. In the video games category, the keywords are boring, uninstall and unplayable. And for the beauty category, the keywords are crap, return and disappointed. We could find that these keywords are quite different for each category so that we could make more accurate prediction if we design different text model for corresponding category. 6.4 Model Comparison and Conclusion And the performances of different models on different datasets are shown in the followint table. K-Core Average Baseline 5 7 9 11 13 1.2577 1.1077 1.0402 1.0587 1.0364 5 7 9 11 13 1.4484 1.3825 1.3347 1.2878 1.2248 5 7 9 11 13 1.3614 1.1298 0.9712 0.9579 0.9628 Text Mining Latent-Factor Model Health & Personal Care 0.8458 1.1013 0.7539 1.0322 0.6884 0.9624 0.7074 0.9361 0.7282 0.8876 Video Games 0.7887 1.0620 0.7809 0.8489 0.7345 0.7299 0.7479 0.7259 0.7521 0.7184 Beauty 0.7928 1.1667 0.7191 0.9142 0.62322 0.7109 0.6018 0.6897 0.6172 0.6894 RBM 1.1653 0.9962 1.0767 Table 4 Performance Comparison of Different Methods on Datasets From the above table we can see that text mining is the best strategy for the rating prediction task given the review text data. It can tower the other models on each category with all core numbers. But if we look into the trend, we will find that the performance of Latent Factor Model continues to improve while the text mining starts to decay. So it implicitly shows that the Latent Factor Model can reach better, even close to text mining, performance with dense dataset. Therefore, we could conclude that for dataset with rich text information, the text mining method could achieve satisfactory prediction accuracy. Then for dataset with dense user-item pair information, the Latent 11

Factor Model could perform quite well. And for RBM, it is a quite novel method with potential to be explored and improved in future research. References [1] Guang Ling, Michael R Lyu, and Irwin King. Ratings meet reviews, a combined approach to recommend. In: Proceedings of the 8th ACM Conference on Recommender systems. ACM. 2014, pp. 105 112. [2] Julian McAuley and Jure Leskovec. Hidden factors and hidden topics: understanding rating dimensions with review text. In: Proceedings of the 7th ACM conference on Recommender systems. ACM. 2013, pp. 165 172. [3] Lizhen Qu, Georgiana Ifrim, and Gerhard Weikum. The bag-of-opinions method for review rating prediction from sparse text patterns. In: Proceedings of the 23rd International Conference on Computational Linguistics. Association for Computational Linguistics. 2010, pp. 913 921. [4] Ruslan Salakhutdinov, Andriy Mnih, and Geoffrey Hinton. Restricted Boltzmann machines for collaborative filtering. In: Proceedings of the 24th international conference on Machine learning. ACM. 2007, pp. 791 798. 12