Data Mining Techniques - PDF Free Download

Data Mining Techniques CS 6 - Section - Spring 7 Lecture Jan-Willem van de Meent (credit: Andrew Ng, Alex Smola, Yehuda Koren, Stanford CS6)

Project

Project Deadlines Feb: Form teams of - people 7 Feb: Submit abstract ( paragraph) Mar: Submit proposals ( pages) Mar: Milestone (exploratory analysis) Mar: Milestone (statistical analysis) 6 Apr (Sun): Submit reports ( pages) Apr (Fri): Submit peer reviews

Project Reports ~ pages (rough guideline) Guidelines for contents Introduction / Motivation Exploratory analysis (if applicable) Data mining analysis Discussion of results

Project Review per person (randomly assigned) Reviews should discuss aspects of the report Clarity (is the writing clear?) Technical merit (are methods valid?) Reproducibility (is it clear how results were obtained?) Discussion (are results interpretable?)

Final Exam

Topic List http://www.ccs.neu.edu/home/jwvdm/teaching/cs6/spring7/final-topics.html Emphasis on post-midterm topics (but some pre-midterm topics included)

Recommender Systems

The Long Tail (from: https://www.wired.com///tail/)

Problem Setting

Problem Setting Task: Predict user preferences for unseen items

Content-based Filtering serious The Color Purple Amadeus Braveheart Geared towards females Sense and Sensibility Ocean s Lethal Weapon Geared towards males Dave The Princess Diaries The Lion King Independence Day Gus Dumb and Dumber escapist

Collaborative Filtering # # # Joe # Idea: Predict rating based on similarity to other users

Problem Setting Task: Predict user preferences for unseen items Content-based filtering: Model user/item features Collaborative filtering: Implicit similarity of users items

Recommender Systems Movie recommendation (Netflix) Related product recommendation (Amazon) Web page ranking (Google) Social recommendation (Facebook) Priority inbox & spam filtering (Google) Online dating (OK Cupid) Computational Advertising (Everyone)

Challenges Scalability Millions of objects s of millions of users Cold start Changing user base Changing inventory Imbalanced dataset User activity / item reviews power law distributed Ratings are not missing at random

Running Example: Netflix Data Training data Test data user movie date score user movie date score /7/ 6 /6/? 8// 96 9//? /6/ 7 8/8/? // //? 768 7// 7 6//? 76 // 8//? 8// 9//? 68 9// 8 8/7/? // 9 //? /8/ 7 7/6/? 6 76 8// 6 69 //? 6 6 6// 6 8 //? Released as part of $M competition by Netflix in 6 Prize awarded to BellKor in 9

Running Yardstick: RMSE rmse(s) = s S X (ˆr ui r ui ) (i,u)s

Running Yardstick: RMSE rmse(s) = s S X (i,u)s (ˆr ui r ui ) (doesn t tell you how to actually do recommendation)

Content-based Filtering

Item-based Features

Per-user Regression Learn a set of regression coefficients for each user w u = argmin w r u Xw

User Bias and Item Popularity

Bias

Bias Moonrise Kingdom..

Bias Moonrise Kingdom.. Problem: Some movies are universally loved / hated

Bias Moonrise Kingdom.. Problem: Some movies are universally loved / hated some users are more picky than others

Bias Moonrise Kingdom.. Problem: Some movies are universally loved / hated some users are more picky than others Solution: Introduce a per-movie and per-user bias

Collaborative Filtering

Neighborhood Based Methods # # # Joe # Users and items form a bipartite graph (edges are ratings)

Neighborhood Based Methods (user, user) similarity predict rating based on average from k-nearest users good if item base is small good if item base changes rapidly (item,item) similarity predict rating based on average from k-nearest items good if the user base is small good if user base changes rapidly

Parzen-Window Style CF Define a similarity sij between items Find set εk(i,u) of k-nearest neighbors to i that were rated by user u Predict rating using weighted average over set How should we define sij?

Pearson Correlation Coefficient User ratings for item i:??????????? User ratings for item j:??????????? s ij = Cov[r ui,r uj ] Std[r ui ]Std[r uj ]

(item,item) similarity Empirical estimate of Pearson correlation coefficient P uu(i,j) (r ui b ui )(r uj b uj ) ˆ ij = q P uu(i,j) (r ui b ui ) P uu(i,j) (r uj b uj ) Regularize towards for small support s ij = U(i, j) U(i, j) + ˆ ij Regularize towards baseline for small neighborhood

Similarity for binary labels Pearson correlation not meaningful for binary labels (e.g. Views, Purchases, Clicks) Jaccard similarity Observed / Expected ratio s ij = m ij + m i + m j m ij s ij = observed expected m ij + m i m j /m m i users acting on i m ij users acting on both i and j m total number of users

Matrix Factorization Methods

Matrix Factorization Moonrise Kingdom..

Matrix Factorization Moonrise Kingdom.. Idea: pose as (biased) matrix factorization problem

Matrix Factorization items. -....6 -... -.... -. -.7..7 - -.9... -..8 -. -.. -... -.. -.7.9. -....7 -.8. -.6.7.8. -..9..7.6 -.. ~ ~ items users users A rank- SVD approximation

Prediction items. -....6 -... -.... -. -.7..7 - -.9... -..8 -. -.. -... -.. -.7.9. -....7 -.8. -.6.7.8. -..9..7.6 -.. ~ ~ items users A rank- SVD approximation users?

Prediction items. -....6 -... -.... -. -.7..7 - -.9... -..8 -. -.. -... -.. -.7.9. -....7 -.8. -.6.7.8. -..9..7.6 -.. ~ ~ items users. A rank- SVD approximation users

SVD with missing values. -....6 -... -.... -. -.7..7 - -.9... -..8 -. -.. -... -.. -.7.9. -....7 -.8. -.6.7.8. -..9..7.6 -.. ~ Pose as regression problem Regularize using Frobenius norm

Alternating Least Squares. -....6 -... -.... -. -.7..7 - -.9... -..8 -. -.. -... -.. -.7.9. -....7 -.8. -.6.7.8. -..9..7.6 -.. ~ (regress wu given X)

Alternating Least Squares. -.. -..6. ~ -.. -.7..... -. -.8. -..7 -....6...7 -.. -. -.9.8. -. -..9.. -.7.8...7. -. -.6 -.9.. -.7. (regress wu given X) L: closed form solution w =(X T X + I) X T y Remember ridge regression?

Alternating Least Squares. -....6 -... -.... -. -.7..7 - -.9... -..8 -. -.. -... -.. -.7.9. -....7 -.8. -.6.7.8. -..9..7.6 -.. ~ (regress xi given W) (regress wu given X)

Stochastic Gradient Descent. -.. -..6. ~ -.. -.7..... -. -.8. -..7 -....6...7 -.. -. -.9.8. -. -..9.. -.7.8...7. -. -.6 -.9.. -.7. No need for locking Multicore updates asynchronously (Recht, Re, Wright, - Hogwild)

Sampling Bias

Ratings are not given at random Netflix ratings Yahoo! music ratings Yahoo! survey answers

Ratings are not given at random users movies users movies rui cui matrix factorization regression data

Temporal Effects

Changes in user behavior Netflix changed rating labels

Movies get better with time?

Temporal Effects Solution: Model temporal effects in bias not weights

Netflix Prize

Netflix Prize Training data million ratings, 8, users, 7,77 movies 6 years of data: - Test data Last few ratings of each user (.8 million) Evaluation criterion: Root Mean Square Error (RMSE) Competition,7+ teams Netflix s system RMSE:.9 $ million prize for % improvement on Netflix

Improvements RMSE.9.9.9.89.89.88.88 Factor models: Error vs. #parameters 6 9 88 Add biases NMF BiasSVD SVD++ SVD v. SVD v. SVD v..87 Millions of Parameters Do SGD, but also learn biases μ, bu and bi

Improvements RMSE.9.9.9.89.89.88.88 Factor models: Error vs. #parameters 6 9 88 who rated what NMF BiasSVD SVD++ SVD v. SVD v. SVD v..87 Millions of Parameters Account for fact that ratings are not missing at random.

Improvements.9.9.9 Factor models: Error vs. #parameters 6 9 88 NMF BiasSVD SVD++ RMSE.89.89.88.88 temporal effects SVD v. SVD v. SVD v..87 Millions of Parameters

Improvements.9.9.9 Factor models: Error vs. #parameters 6 9 88 NMF BiasSVD SVD++ RMSE.89.89.88.88 temporal effects SVD v. SVD v. SVD v..87 Millions of Parameters Still pretty far from.86 grand prize

Winning Solution from BellKor

Last days June 6 th submission triggers -day last call

BellKor fends off competitors by a hair

Ratings aren t everything Netflix then Netflix now Only simpler submodels (SVD, RBMs) implemented Ratings eventually proved to be only weakly informative