CS294-1 Assignment 2 Report

Similar documents
CS 294-1: Assignment 2 A Large-Scale Linear Regression Sentiment Model

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

Louis Fourrier Fabien Gaie Thomas Rolf

model order p weights The solution to this optimization problem is obtained by solving the linear system

Lecture on Modeling Tools for Clustering & Regression

CS294-1 Final Project. Algorithms Comparison

Akarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

Network Traffic Measurements and Analysis

Lasso. November 14, 2017

CSE 546 Machine Learning, Autumn 2013 Homework 2

Recommendation System Using Yelp Data CS 229 Machine Learning Jia Le Xu, Yingran Xu

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Chapter 7: Numerical Prediction

Yelp Recommendation System

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

Robust PDF Table Locator

Evaluating Classifiers

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Cost Sensitive Time-series Classification Shoumik Roychoudhury, Mohamed Ghalwash, Zoran Obradovic

The exam is closed book, closed notes except your one-page cheat sheet.

Fast or furious? - User analysis of SF Express Inc

Bilevel Sparse Coding

Evaluating Classifiers

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Lecture 16: High-dimensional regression, non-linear regression

Machine Learning / Jan 27, 2010

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Predicting Popular Xbox games based on Search Queries of Users

Problem 1: Complexity of Update Rules for Logistic Regression

FastText. Jon Koss, Abhishek Jindal

CSE 547: Machine Learning for Big Data Spring Problem Set 2. Please read the homework submission policies.

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Logistic Regression and Gradient Ascent

Hyperparameters and Validation Sets. Sargur N. Srihari

Collaborative Filtering Applied to Educational Data Mining

Image Deconvolution.

1 Training/Validation/Testing

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

Linear Regression Optimization

Sentiment analysis under temporal shift

Logistic Regression: Probabilistic Interpretation

CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

General Instructions. Questions

Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please)

3 Nonlinear Regression

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Convex Optimization / Homework 2, due Oct 3

Predicting User Ratings Using Status Models on Amazon.com

Classification Part 4

Neural Networks (pp )

Parallel Stochastic Gradient Descent

Facial Expression Classification with Random Filters Feature Extraction

Gradient Descent Optimization Algorithms for Deep Learning Batch gradient descent Stochastic gradient descent Mini-batch gradient descent

node2vec: Scalable Feature Learning for Networks

Overfitting. Machine Learning CSE546 Carlos Guestrin University of Washington. October 2, Bias-Variance Tradeoff

Lecture 37: ConvNets (Cont d) and Training

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011

C3 Numerical methods

Opinion Mining by Transformation-Based Domain Adaptation

6. Linear Discriminant Functions

Convexization in Markov Chain Monte Carlo

Neural Network Learning. Today s Lecture. Continuation of Neural Networks. Artificial Neural Networks. Lecture 24: Learning 3. Victor R.

Stat 4510/7510 Homework 6

Online Algorithm Comparison points

Predict the Likelihood of Responding to Direct Mail Campaign in Consumer Lending Industry

Amazon Review Rating Prediction with Text-Mining, Latent-Factor Model and Restricted Boltzmann Machine

3 Nonlinear Regression

Classification of Imbalanced Marketing Data with Balanced Random Sets

HMC CS 158, Fall 2017 Problem Set 3 Programming: Regularized Polynomial Regression

Predict the box office of US movies

CS 224N: Assignment #1

CSE 250B Project Assignment 4

Cross-validation. Cross-validation is a resampling method.

Statistical Consulting Topics Using cross-validation for model selection. Cross-validation is a technique that can be used for model evaluation.

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

Today. Gradient descent for minimization of functions of real variables. Multi-dimensional scaling. Self-organizing maps

Optimization Plugin for RapidMiner. Venkatesh Umaashankar Sangkyun Lee. Technical Report 04/2012. technische universität dortmund

Machine Learning: Think Big and Parallel

Recommendation Systems

Gradient LASSO algoithm

CS249: ADVANCED DATA MINING

Model selection and validation 1: Cross-validation

1 StatLearn Practical exercise 5

LSTM for Language Translation and Image Captioning. Tel Aviv University Deep Learning Seminar Oran Gafni & Noa Yedidia

Weka ( )

10-701/15-781, Fall 2006, Final

5 Learning hypothesis classes (16 points)

Linear Methods for Regression and Shrinkage Methods

Subsemble: A Flexible Subset Ensemble Prediction Method. Stephanie Karen Sapp. A dissertation submitted in partial satisfaction of the

CPSC 340: Machine Learning and Data Mining

Stat 342 Exam 3 Fall 2014

Kaggle See Click Fix Model Description

A Neuro Probabilistic Language Model Bengio et. al. 2003

Notes on Multilayer, Feedforward Neural Networks

Hands on Datamining & Machine Learning with Weka

I How does the formulation (5) serve the purpose of the composite parameterization

Lecture 19: November 5

Transcription:

CS294-1 Assignment 2 Report Keling Chen and Huasha Zhao February 24, 2012 1 Introduction The goal of this homework is to predict a users numeric rating for a book from the text of the user s review. The original dataset contain 975194 Amazon book reviews. For each experiment, we perform 10-fold cross validation to estimate prediction error. 2 Problem Assume there are altogether K distinct users and I unique product reviewed in our dataset, label them u 1, u 2,..., u K and p 1, p 2,..., p I respectively. For individual user u k, denote the number of reviews s/he makes as n k. And further assume that there are in total N reviews across all different users and products in the corpus, where N = K k=1 n k. The review documents are labeled as D 1, D 2,..., D N, and we arrange the documents in the way that reviews from the same user stays together and in the incremental order of user labels. The vocabulary set (dictionary) of all the documents are denoted as V, and the dimension V = V. Each document is associated with a review score and a bag-of-word feature vector. The feature vector f n = (fn, 1 fn, 2..., fn V ) keeps track of the number of appearance of each individual word v V in document D n, and we would expect it is a sparse vector with positive integer values. Further let s = (s 1, s 2,..., s N ) T denote the score vector, so that the kth entry s k is the review score of document D k. Similarly, feature vectors of each document are stacked up to compose the documents matrix X = (f 1 ; f 2 ;... ; f N ). Notice that feature vector has the same dimension as the vocabulary set V, and X would be an N by V matrix. It would be interesting to predict review scores based on document feature vectors. We propose a linear prediction model that assigns weight to each word and the final scores are adjusted by user biases. Concretely, predicted scores ŝ can be calculated by, ŝ = Xw (1) where w is V dimensional weight vectors with each entry denoting the strength of determining the score of the corresponding word. 1

The problem now is to estimate parameter w given the dataset. Since most word would not contribute much to the final score, we expect that w has a sparse structure. Therefore, it is natural to consider the optimization problem to minimize the l 2 norm of the error, that is, 3 Methods 3.1 Part 1 min w,c ŝ s 2 (2) We try both Ridge and Lasso regularization to solve the above optimization problem. Exact solution is computed for Ridge and stochastic gradient decent is used to approximate Lasso, with different penalizing factor λ. They are considered as baseline algorithm in this paper. To further boost the prediction accuracy, we also add the following two features. 3.1.1 Reviewer Preference Some reviewers tend to give higher scores than the others given the same attitude towards the product. One more feature is added to each review to characterize this reviewer bias. Specifically, equation (1) is modified to, ŝ = Xw + c (3) where vector c is of the same size, representing customer biases. According to the order we arrange the review documents, we should have c 1 = c 2 = = c n1, c n1 +1 = c n1 +2 = = c n1 +n 2 and etc, because scores rated by the same user should have the same bias. 3.1.2 Rating Drift Rating scores are discrete in nature, and it is hard to say rating 4-star shows reviewer s attitude exactly in the middle of 3-star and 5-star. Given 1-star and 5-star represents two extremes of reviewers attitude towards a product, we introduce adjustments to reviews with 2 or 4 stars. The star drift is another private parameter for each individual reviewer. 3.2 Part 2 Considering there are duplicated reviews which might affect the prediction accuracy, we hash on the first 15 words of each review to label unique reviews and remove duplicated reviews. We get around 500,000 unique reviews. Feature words with low frequency and some stop words are removed to reduce the background noise. Finally we obtain a predictor matrix X with 502460 samples and 10999 features, and a response vector Y with 502460 rows. The dataset is randomized before 10-fold cross validation in order to reduce the bias that may 2

be caused by original ordering of the data. We approximate l2 loss function as shown in (4). In order to avoid overtting, we add an l2 norm regularizer to the loss function with the parameter λ > 0. β ridge = arg min β (Y Xβ β 0 ) T (Y Xβ β 0 ) + λ β 2 (4) We solve the above optimization problem by algorithm below using stochastic gradient decent. The procedure starts with β = 0, β R p. Then it updates iteratively every coordinate of the vector until convergence. At each iteration t, randomly choose a block of training data X b, Y b : 1) G = f β = XT b (Y b X b β) + λβ; 2) update β t+1 = β t α t G, where we optimize stepsize α t ateachiterationtbysolving min(y b X b (β t α t G t )) T (Y b X b (β t α t G t )) + λ β t α t G t 2 ; 3) Use the remaining training data to compute Root MSE as criterion. 4 Results 4.1 Part 1 4.1.1 Model Comparison Four models, baseline Ridge model (unigram), baseline Lasso model, baseline + reviewer preference, and baseline+ preference + star drift, are compared in this section. We use 10-fold cross validation and Root MSE as performance measures. Each model is tested with varying λ ranging from 0.1, 0.2, 0.5, 1.0, 2.0. The best performance λ is chosen for each model, and corresponding RMSE for the four models are plotted in Figure 1. We can see that l 2 penalizing term outperforms l 1, and the last model beat the others significantly. 3

1.25 1.2 1.15 Comparison of Best Cross Validation Performance(RMSE) baseline l1 baseline l2 include individual rating preference include individual rating preference and star drift 1.1 1.05 1 0.95 0.9 0.85 0.8 0.75 1 2 3 4 5 6 7 8 9 10 Figure 1: Model Comparisons: X-axis is 10-fold cross validation, Y-axis is RMSE. 4.1.2 ROC and Lift Score We further compare the performance of the full model and the model without star drift adjustment using ROC and Lift score measures. Even though the full model outperforms the others in RMSE measure, it does not show significant improvement in classifying sentiment polarization. This is illustrated in Figure 2 and 3. 4.1.3 Strongest Terms We use unigram throughout the experiment, the strongest positive and negative terms are listed in Table 1. 4

1 0.9 ROC plot Full Model Without Star Drift 0.8 0.7 sensitivities 0.6 0.5 0.4 0.3 0.2 0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 specificity Figure 2: ROC plot 5

90 80 Lift plot Full Model Without Star Drift 70 60 lift value 50 40 30 20 10 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 specificity Figure 3: Lift plot 6

Figure 4: Stochastic gradient descent training results Positive Negative Table 1: Terms with Strongest Weights Fascinating, Excellent, unsettling, Highly, adorable, Laden, punches, Brilliant, Wonderful, Loved poorly, Save, waste, disappointment, useless, disjointed, disappointing, unreadable, Sorry, drivel 4.2 Part 2 The data is split into 10 roughly equal-sized folds ( documents each), so that estimated prediction error (RMSE) is the average of the trials from 10-fold cross validation. We used word unigrams as the features. Here is an example of one training procedure. The RMSE decreases from 1.5176 to 0.9986 after 800 iteration. The RMSE of testing data for 10-fold cross validation is shown below. The average RMSE is 1.0081. 7

Figure 5: RMSE as prediction error for testing data by stochastic gradient descent 5 Discussion Based on our results, we conclude that the model including individual rating preference and star drift is better suited for capturing features and minimize prediction error. Although we focus on the unigrams to model reviews during stochastic gradient descent, we believe the same framework can benefit from modeling n-grams. We also tried different λ and different block size of training samples. It turn out that if the block size is too small, the descent results are not very good, probably because the randomness between small samples lead to big fluctuation of the gradient descent. 8