By Atul S. Kulkarni Graduate Student, University of Minnesota Duluth. Under The Guidance of Dr. Richard Maclin

Size: px

Start display at page:

Download "By Atul S. Kulkarni Graduate Student, University of Minnesota Duluth. Under The Guidance of Dr. Richard Maclin"

Stewart Lawrence
6 years ago
Views:

1 By Atul S. Kulkarni Graduate Student, University of Minnesota Duluth Under The Guidance of Dr. Richard Maclin

2 Outline Problem Statement Background Proposed Solution Experiments & Results Related Work Future Work Conclusion Q & A

4 Problem Statement Given a set of users with their previous ratings for a set of movies, can we predict the rating they will assign to a movie they have not previously rated? Netflix puts it as The Netflix Prize seeks to substantially improve the accuracy of predictions about how much someone is going to love a movie based on their movie preferences. Improve it enough and you win one (or more) Prizes. Winning the Netflix Prize improves our ability to connect people to the movies they love. So what do they want? 10% improvement to their existing system. They are paying $1 Million for this.

5 Problem Statement Similarly, which movie will you like given that you have seen X-Men, X-Men II, X-Men : The Last Stand and users who saw these movies also liked X-Men Origins : Wolverine? Answer:?

6 Dataset Background for the problem Background for the Solution

7 Background - Dataset Netflix Prize Dataset Netflix released data for this competition Contains nearly 100 Million ratings Number of users (Anonymous) = 480,189 Number of movies rated by them = 17,770 Training Data is provided per movie To verify the model developed without submitting the predictions to Netflix probe.txt is provided To submit the predictions for competition qualifying.txt is used

8 Background - Dataset Data in the training file is per movie It looks like this Movie# Customer#,Rating,Date of Rating Customer#,Rating,Date of Rating Customer#,Rating,Date of Rating - Example 4: ,3, ,1, ,5,

9 Background - Dataset Data points in the probe.txt looks like this (Have answers) Movie# Customer# Customer# Data in the qualifying.txt looks like this (No answers) Movie# Customer#, DateofRating Customer#, DateofRating 1: : , , ,

10 Background Dataset stats Total ratings possible = 480,189 (user) * 17,770 (movies) = (8.5 Billion) Total available = 100 Million The User x Movies matrix has 8.4 Billion entries missing Sparse Data

11 Background of the problem Recommender Systems Examples: Yahoo, Google, youtube, Amazon. Recommend item that you might like. The recommendation is made based on past behavior. Collaborative Filtering [Gábor, 2009] What is it? Who collaborates and what is filtered? How can it be applied in this contest?

12 Background of the problem Earlier systems implemented in 1990s. GroupLens (Usenet articles) [Resnick, 1997] Siteseer (Cross linking technical papers)[resnick, 1997] Tapestry ( filtering) [Goldberg, 1992] Earlier solutions provided for users to rate the item. Two major divisions of methods Model based fit a model to the training data. Memory based Nearest Neighbor Methods.

13 Background for the Solution K-Nearest Neighbor (K-NN) method. Memory Based method. Measures Distance between the query instance and every instance in the training set. Find the K training instances with the least distance from query instance. Using these K instances, average their rating for this movie for these training instances. Distances can be measured using the following formulae.

14 Background for the Solution Distance formulae. Distance Formula Manhattan Distance features f 1 x i, f x j, f Euclidean Distance features f 1 ( x i, f x j, f ) 2 Minkowsky Distance p features ( x i x f 1 p, f j, f ) Mahalanobis Distance 1 ( x x ) ( x x ) i j i j T

15 Background for the Solution How important is distance measure? Curse of Dimensionality. Example: what if we were to characterize the movie by it actors, directors, writers, genre, and then all of its CREW? What is the problem? What if some attributes are more dominant than others? Example: Cost of home are much larger quantities than person s height.

16 Background of the Solution What if I was very conservative about my rating and someone else was too generous? I rate the movie I like the most as 3 and the least as 1. someone else rates his/her high at 5 and high at 3. So am I like this person? Difficult to say. We are comparing two people with very high personal biases. Which will result in obvious flawed similarity measure. Solution? Normalization of the data.

17 Background for the Solution Normalization What is that? How do we do it? How will it change my ratings? Won t I loose the original rating? We will calculate Mean rating for every user over the movies he / she has rated Also calculate standard deviation for the user s rating. From every rating we will subtract the user s mean rating and divide it by their standard deviation.

18 Background for the Solution Should all members of the neighborhood contribute equally to the prediction? Not always, we can argue that people who are similar to you, i.e. have least distance from you should contribute more than farther ones. This is done by weighing the prediction by the instance s distance from the query instance.

19 Background for the Solution Clustering Idea is to group the items together based on their attributes. Data is typically unlabeled. Similarity is measured using the distance between the two points. Example: Consider going in to a comic book shop and putting together comics from a pile of comics that are similar. Types: Partitional Clustering: K-Means Hierarchical clustering: Agglomerative Clustering

20 Background for the Solution K-Means clustering [MacQueen, 1967] Randomly select K instances as cluster centers. Label every data point with its nearest cluster centers. Re-compute the cluster centers. Repeat the above two steps until no instances change clusters or certain iterations have gone by. How is it related to our discussion today?

21 K Nearest Neighbor Algorithm Clustering Based Nearest Neighbor Algorithm

22 Proposed Solution K-Nearest Neighbor approach (Overview) Given a query instance q(movieid, UserId) normalize the data before processing. Find the distance of this instance with all the users who rated this movie. Of the these users select the K users that are nearest to the query instance as its neighborhood. Average the rating of the users form this neighborhood for this particular movie. This is the predicted rating for the query instance.

23 Proposed Solution - Example Example: (Representative data, not real) Matrix Star Wars Dark knight Rocky Sita Aur Gita Star Trek Cliffhanger A.I. MI X-Men Jim Sean John Sidd Penny Pete 5? 4 4

24 Proposed Solution - Example calculate the Mean and Standard Deviation vectors. meanrating standarddeviation Jim Sean John Sidd Penny Pete

25 Proposed Solution - Example Normalized data Matrix Star Wars Dark knight Rocky Sita Aur Gita Star Trek Cliffhanger A.I. MI X-Men Jim Sean John Sidd Penny Pete 1.15?

26 Proposed Solution - Example So now we have a query instance q(pete, Sita Aur Gita) i.e. we wish to evaluate how much will Pete like movie Sita Aur Gita on a scale of 1-5. To do this we need to indentify Pete s two neighbors who rated this movie. (2-NN case). Users who rated the movie Sita Aur Gita are. candidate_users Jim Sidd Penny

27 Proposed Solution - Example Users with their distance and the 2 neighbors in the neighborhood are Users Distance Jim Sidd Peny Nearest Neighbors are Jim and Sidd.

28 Proposed Solution - Example The average of the ratings by Jim and Sidd to movie Sita Aur Gita is So is our prediction correct? Not yet. This prediction is in normalized form. We need to bring it back to Pete s prediction level. How? Multiply by Standard Deviation of Pete s ratings. Add Pete mean rating to this product. ( * ) = So predicted rating for Pete is

29 Proposed Solution C-K-NN Clustering based Nearest Neighbor appraoch Obtain for every movie its genre from external sources. (IMDB in our case) Create for every user a vector representing each genre as one cell. In that cell we count number of movies that users has rated for the genre. (We have one such vector for each user.) Cluster the users as per the genres of the movie they have rated. Cluster centers of these clusters represent the collective opinion of the users in that cluster about the movies of that particular genre. We call them Super Users

30 Proposed Solution C-K-NN For each super user we predict rating of all the movies of that genre as the average of the ratings of the users that rated the movie. When presented with query point q(movieid, userid). We find all the genre for that movie. For each genre we calculate distance of the user from cluster centers for the genre. We select the nearest K cluster centers and average the rating of these cluster centers for the movie to predict movie rating for this genre. We average per genre predicted rating and get the predicted ratings for q.

31 Proposed Solution Example (C-K-NN) We use the data from our previous example. (recap) Matrix Star Wars Dark knight Rocky Sita Aur Gita Star Trek Cliffhanger A.I. MI X-Men Jim Sean John Sidd Penny Pete 5? 4 4

32 Proposed Solution Example (C-K-NN) We find genre for every movie. Action Adventure Crime Drama Fantasy Sci-Fi Sport Thriller Matrix Star Wars Dark Knight Rocky 1 1 Sita Aur Gita 1 1 Star Trek 1 1 Cliffhanger A.I MI X-Men 1 1 1

33 Proposed Solution Example (C-K-NN) Convert User Movie Data to User Genre Action Adventure Crime Drama Fantasy Sci-Fi Sport Thriller Jim Sean John Sidd Penny Pete

34 Proposed Solution Example (C-K-NN) We cluster users in to two clusters. Action Adventure Crime Drama Fantasy Sci-Fi Sport Thriller Jim Sean John Sidd Penny Pete

35 Proposed Solution Example (C-K-NN) The query point as last time is q(pete, Sita Aur Gita) Per genre cluster look like (Genres of Sita Aur Gita ) Adventure Drama 1 2 Matrix Star Wars Sita Aur Gita 2 3 A.I. 5 Sita Aur Gita Star Trek 3 4 Cliffhanger A.I. 5 MI 2 1

36 Proposed Solution Example (C-K-NN) Distance of Pete from Cluster centers of Adventure 1 2 Pete Distance of Pete from Cluster centers of Drama Not applicable as Pete does not rate any movie from that genre. We try to find one (K=1) nearest cluster for Adventure genre. That is cluster two.

37 Proposed Solution Example (C-K-NN) Hence, the rating for the query point q(pete, Sita Aur Gita) calculated by taking the rating of cluster two of adventure genre. Our prediction is: 2 for this movie. What if Pete had rated one of the movies from drama genre? We would predict the rating for Drama genre as well for Pete Then, average the predicted rating for the two genre to get the final rating.

39 Experiments Setup Dataset used Netflix Prize Dataset. Experiments performed on Randomly selected 1121 movies covering users. These data instances are chosen form the probe file from the Netflix Dataset. We have the ratings for these instances in the training data. These instances are treated as Hold out set in the experiments.

40 Experiments Setup We normalize the data for the K-NN method Predictions so done are converted back to the denormalized form We test the same set of movie, user pairs on both methods Standard K-Nearest Neighbor Clustered-K-Nearest Neighbor

41 Experiments - Setup This is a regression problem, hence we want to know if we are off the expected value, how off are we? Hence, Test Metric used is Root Mean Square Error (RMSE): Absolute Average Error (AAE): Time taken.

42 Experiments - Implementation K-NN Implemented C / C++. Classes converted to Structure. Difficult to manage the massive dataset in the memory. Size of the program makes it difficult to run in C++ Comparison to every user needs a lot of fine tuning of the code to achieve a reasonable performance K-NNs inherent problem Ease of implementation vs. speed is important trade off Using maps, vectors only adds storage speed added is negated by this.

43 Experiments - Implementation C-K-NN Implemented using Perl, Matlab, Python, MySQL. Perl s hashes of hashes came to rescue Ease of token / string processing was most helpful Complex logic hence easy to express in Perl (Regex help) Python Interfaces with IMDB (IMDbPY), MySQL has local database of IMDB. Matlab does the clustering (K-Means) Fine tuning of algorithm and ample available memory negates the slow / interpreted nature of the languages.

44 Experiments - Results Result on described dataset Method Absolute Average Error Root Mean Square Error Time (Minutes) K-NN * C-K-NN Netflix (Ladder Board NA NA Topper) Netflix Current System 1 NA NA

Experiments - Results 1 0.9 RMSE Comparisons Comparison of the RMSE and Absolute Average Error Time taken Time in Minutes 0.8 0.

45 Experiments - Results RMSE Comparisons Comparison of the RMSE and Absolute Average Error Time taken Time in Minutes RMSE Time in Minutes Absolu te Averag e Error K-NN C-K-NN Netflix (Current Topper) Netflix (Current System) 0 K-NN C-K-NN

46 Experiments - Results Distribution of the Absolute Average Error for K-NN and C-K-NN methods Number of Movies with error for standard K- NN method Number of Movies with error for C-K-NN method

48 Related Work Methods already applied to this problem are Matrix Factorization Methods Regularized Singular Value Decomposition [Paterek, 2007][Webb, 2007] Baises with Regularized SVD [Paterek, 2007] Probabilistic Latent Semantic Analysis plsa [(Hofmann, 2004] Nearest Neighbor Methods [Bell and Koren, 2007] Alternate Least Squares [Bell and Koren, 2007] Post processing of SVD features. [Paterek, 2007]

50 Future Work K-NN method Different values of K could experimented Distributed processing of this problem Distance weighing the contributions from neighbors C-K-NN Trying different # of clusters Dates provided with the ratings could be used in clustering along with genre More information form IMDB or other sources might included Application of Movie clustering and then predicting the rating for users is also possible

52 Conclusions We presented results of two methods to solve the Netflix Prize Problem including a novel based clustering method First method, a standard K-Nearest Neighbor method although gets lower RMSE value is very slow in prediction A function of comparison with every user who rated this movie Second method, clusters the users based on the genre of the movies they rated and creates super users from these clusters

53 Conclusions Standard K-NN method performs slightly better compared to the Clustering based method on the Root Mean Square Error metric but is extremely slow Our clustering based method has higher Root Mean Square Error than Standard K-NN method but is extremely fast and practical for large scale method implementations It also shows promise of being accurate for many predictions

55 Atul S Kulkarni kulka053@d.umn.edu

Recommender Systems New Approaches with Netflix Dataset

Recommender Systems New Approaches with Netflix Dataset Robert Bell Yehuda Koren AT&T Labs ICDM 2007 Presented by Matt Rodriguez Outline Overview of Recommender System Approaches which are Content based