Extension Study on Item-Based P-Tree Collaborative Filtering Algorithm for Netflix Prize

Similar documents
An Empirical Comparison of Collaborative Filtering Approaches on Netflix Data

Performance Comparison of Algorithms for Movie Rating Estimation

Recommender Systems New Approaches with Netflix Dataset

Property1 Property2. by Elvir Sabic. Recommender Systems Seminar Prof. Dr. Ulf Brefeld TU Darmstadt, WS 2013/14

Comparison of Recommender System Algorithms focusing on the New-Item and User-Bias Problem

Recommender System using Collaborative Filtering Methods: A Performance Evaluation

Use of KNN for the Netflix Prize Ted Hong, Dimitris Tsamis Stanford University

Improved Neighborhood-based Collaborative Filtering

A Scalable, Accurate Hybrid Recommender System

A Random Walk Method for Alleviating the Sparsity Problem in Collaborative Filtering

Progress Report: Collaborative Filtering Using Bregman Co-clustering

Collaborative Filtering based on User Trends

A New Technique of Lossless Image Compression using PPM-Tree

Recommender System Optimization through Collaborative Filtering

Recommender Systems - Content, Collaborative, Hybrid

Robustness and Accuracy Tradeoffs for Recommender Systems Under Attack

System For Product Recommendation In E-Commerce Applications

Towards a hybrid approach to Netflix Challenge

Seminar Collaborative Filtering. KDD Cup. Ziawasch Abedjan, Arvid Heise, Felix Naumann

Content-based Dimensionality Reduction for Recommender Systems

Movie Recommender System - Hybrid Filtering Approach

Comparing State-of-the-Art Collaborative Filtering Systems

amount of available information and the number of visitors to Web sites in recent years

Using Social Networks to Improve Movie Rating Predictions

The Tourism Recommendation of Jingdezhen Based on Unifying User-based and Item-based Collaborative filtering

Additive Regression Applied to a Large-Scale Collaborative Filtering Problem

Domain Independent Prediction with Evolutionary Nearest Neighbors.

A Time-based Recommender System using Implicit Feedback

Rating Prediction Using Preference Relations Based Matrix Factorization

Personalize Movie Recommendation System CS 229 Project Final Writeup

Data Mining Techniques

Hybrid algorithms for recommending new items

BordaRank: A Ranking Aggregation Based Approach to Collaborative Filtering

Solving the Sparsity Problem in Recommender Systems Using Association Retrieval

Proposing a New Metric for Collaborative Filtering

CS224W Project: Recommendation System Models in Product Rating Predictions

Using Data Mining to Determine User-Specific Movie Ratings

arxiv: v2 [cs.lg] 15 Nov 2011

Improving Results and Performance of Collaborative Filtering-based Recommender Systems using Cuckoo Optimization Algorithm

Collaborative filtering models for recommendations systems

Collaborative Filtering Applied to Educational Data Mining

Part 11: Collaborative Filtering. Francesco Ricci

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

A PERSONALIZED RECOMMENDER SYSTEM FOR TELECOM PRODUCTS AND SERVICES

A Constrained Spreading Activation Approach to Collaborative Filtering

CptS 570 Machine Learning Project: Netflix Competition. Parisa Rashidi Vikramaditya Jakkula. Team: MLSurvivors. Wednesday, December 12, 2007

Data Mining Techniques

Collaborative Filtering using Weighted BiPartite Graph Projection A Recommendation System for Yelp

Collaborative Filtering using Euclidean Distance in Recommendation Engine

Recommender Systems (RSs)

Learning Bidirectional Similarity for Collaborative Filtering

By Atul S. Kulkarni Graduate Student, University of Minnesota Duluth. Under The Guidance of Dr. Richard Maclin

Semantically Enhanced Collaborative Filtering on the Web

New user profile learning for extremely sparse data sets

Collaborative Filtering: A Comparison of Graph-Based Semi-Supervised Learning Methods and Memory-Based Methods

Project Report. An Introduction to Collaborative Filtering

Collaborative Filtering using a Spreading Activation Approach

NLMF: NonLinear Matrix Factorization Methods for Top-N Recommender Systems

Application of Dimensionality Reduction in Recommender System -- A Case Study

SOCIAL MEDIA MINING. Data Mining Essentials

Survey on Collaborative Filtering Technique in Recommendation System

Recommender Systems. Techniques of AI

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011

A PROPOSED HYBRID BOOK RECOMMENDER SYSTEM

Recommender System. What is it? How to build it? Challenges. R package: recommenderlab

CS249: ADVANCED DATA MINING

Two Collaborative Filtering Recommender Systems Based on Sparse Dictionary Coding

Effective Matrix Factorization for Online Rating Prediction

Music Recommendation Engine

Multimedia Data Mining Using P-trees 1,2

Music Recommendation with Implicit Feedback and Side Information

Privacy-Preserving Collaborative Filtering using Randomized Perturbation Techniques

Part 11: Collaborative Filtering. Francesco Ricci

arxiv: v1 [cs.ir] 1 Jul 2016

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Mining Web Data. Lijun Zhang

Data Sparsity Issues in the Collaborative Filtering Framework

A Novel Non-Negative Matrix Factorization Method for Recommender Systems

A Recursive Prediction Algorithm for Collaborative Filtering Recommender Systems

Location-Aware Web Service Recommendation Using Personalized Collaborative Filtering

CAN DISSIMILAR USERS CONTRIBUTE TO ACCURACY AND DIVERSITY OF PERSONALIZED RECOMMENDATION?

Recommendation system Based On Cosine Similarity Algorithm

Mining Web Data. Lijun Zhang

Computational Intelligence Meets the NetFlix Prize

José Miguel Hernández Lobato Zoubin Ghahramani Computational and Biological Learning Laboratory Cambridge University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

COMP 465: Data Mining Recommender Systems

Semantic feedback for hybrid recommendations in Recommendz

Singular Value Decomposition, and Application to Recommender Systems

An Efficient Neighbor Searching Scheme of Distributed Collaborative Filtering on P2P Overlay Network 1

Influence in Ratings-Based Recommender Systems: An Algorithm-Independent Approach

Thanks to Jure Leskovec, Anand Rajaraman, Jeff Ullman

CHAPTER 7. PAPER 3: EFFICIENT HIERARCHICAL CLUSTERING OF LARGE DATA SETS USING P-TREES

Hybrid Recommendation System Using Clustering and Collaborative Filtering

Slope One Predictors for Online Rating-Based Collaborative Filtering

THE goal of a recommender system is to make predictions

Hybrid Weighting Schemes For Collaborative Filtering

Recommendation Algorithms: Collaborative Filtering. CSE 6111 Presentation Advanced Algorithms Fall Presented by: Farzana Yasmeen

Available online at ScienceDirect. Procedia Technology 17 (2014 )

Matrix Co-factorization for Recommendation with Rich Side Information and Implicit Feedback

Transcription:

Extension Study on Item-Based P-Tree Collaborative Filtering Algorithm for Netflix Prize Tingda Lu, Yan Wang, William Perrizo, Amal Perera, Gregory Wettstein Computer Science Department North Dakota State University Fargo, ND 58108, USA {tingda.lu, yan.wang, william.perrizo, amal.perera, g.w}@ndsu.edu Abstract Recommation systems provide customers with personalized recommations by analyzing the purchase history. Item-based Collaborative Filtering (CF) algorithms recomm items which are similar to what the customers purchased before. Item-based algorithms are widely employed over user-based algorithms due to less computational complexity and better accuracy. We implement several types of item-based CF algorithms with P-Tree data structure. Similarity corrections and item effects are further included in the algorithms to exclude support and item variance. Our experiment on Netflix Prize data suggests support based similarity corrections and item effects have an significant impact on the prediction accuracy. Pearson and SVD item-feature similarity algorithms with support based similarity correction and item effects achieve better RMSE scores. 1 INTRODUCTION With the increasing global competition in the marketing field, e-commerce is moving from customer acquisition to customer retention. Recommation systems provide personalized recommations and hence increases customer loyalty, which eventually leads to business success. Through analyzing the customer s purchase history, the Collaborative Filtering algorithm identifies customer s preference and recomms the most likely purchased items. Either user-based or item-based Collaborative Filtering (CF) algorithms have been successfully deployed in e-commerce [1]. The underlying assumption of the user-based Collaborative Filtering algorithms, or social filtering methods, is that if two buyers purchased the same items in the past, they should have the same preferences, and one buyer is likely to buy the products that have already been purchased by the other. However, the computation complexity of the user-based CF algorithm is linear to the numbers of users in the data set. The limitation of computational resources and overwhelming transactional data exceed the capability of current user-based similarity algorithms. Item-based filtering algorithm assumes that a buyer is likely to buy similar items. This assumption is quite true, especially in movie renting or purchasing context, since most customers have a stable preference on the movie genre, actor, director, etc. Compared to the user-based algorithm, the item-base CF algorithm has less scalability concerns and a better prediction accuracy [2][3][4]. This paper is an extension to our previous experimental work on item-based P-Tree Collaborative Filtering algorithm [13]. The rest of the paper is organized as follows. The next section describes P-Tree algorithm, which is for quick and efficient data processing. Section 3 provides an introduction to itembased Collaborative Filtering algorithm. In section 4 we present the experimental results of item-based filtering on Netflix Prize data set. The final section gives conclusions and directions for future research work. 2 P-TREE ALGORITHM Tremous volumes of data causes the cardinality problem for conventional item-based Collaborative Filtering algorithm. For fast and efficient data processing, we transform the data into P-Tree [12], the lossless, compressed, and data-mining-ready vertical data structures. P-trees are used for fast computation of counts and for masking specific phenomena. This vertical data representation consists of set structures representing the data column-by-column rather than row-by row (horizontal relational data). Predicate-trees are

one choice of vertical data representation, which can be used for data mining instead of the more common sets of relational records. This data structure has been successfully applied in data mining applications ranging from Classification and Clustering with K-Nearest Neighbor, to Classification with Decision Tree Induction, to Association Rule Mining [14][15][16][17][18]. A basic P-tree represents one attribute bit that is reorganized into a tree structure by recursive sub-division, while recording the predicate true value for each division. Each level of the tree contains truth-bits that represent sub-trees and can then be used for phenomena masking and fast computation of counts. This construction is continued recursively down each tree path until downward closure is reached. For example, if the predicate is purely 1 bits, downward closure is reached when purity is reached (either purely 1 bits or purely 0 bits). In this case, a tree branch is terminated when a sub-division is reached that is entirely pure (which may or may not be at the leaf level). These basic P-trees and their complements are combined using Boolean algebra operators such as AND(&) OR( ) and NOT( ) to produce mask P-trees for individual values, individual tuples, value intervals, tuple rectangles, or any other attribute pattern. The root count of any P-tree will indicate the occurrence count of that pattern. The P-tree data structure provides a structure for counting patterns in an efficient, highly scalable manner. 3 ITEM-BASED COLLABORATIVE FILTERING ALGORITHM Item-based P-Tree Collaborative Filtering algorithm is illustrated in Algorithm 1. The raw horizontal data is transformed to vertical P-Tree structure for the first time. P-Tree is then saved in binary file on the disk and can be loaded later. To predict how user u rates on item i, we build the item-based similarity matrix and identify the top K most similar items for item i. The prediction is then made based on u s ratings on these neighbor items. 3.1 Item-based Similarity In this section, we present several similarity functions to compute the similarity between items. These similarity functions between item i and j are normalized by the co-support users who rate both item i and j, instead of users who rate either item i or j. The purpose is to alleviate the sparsity in the data set where most users purchased or rated less than 1% of the items in the item set. In the following subsections, U denotes the co-support between movies. Input: UserSet U, ItemSet I, observed ratings T, neighbor size K Output: r (u,i) Q if PTree.exists() then P T ree.load binary(); else P T ree.transform(); P T ree.save binary(); foreach i I do foreach j I do sim i,j = sim(p T ree[i], P T ree[j]); pt = P T ree.get items(u); sort(pt.begin(), pt.(), sim i,pt.get index() ); sum = 0.0, weight = 0.0; for j = 0; j < K; + + j do sum+ = r u,pt[j] sim i,pt[j] ; weight+ = sim i,pt[j] ; r u,i = sum/weight; Algorithm 1: Item-based P-Tree CF Algorithm 3.1.1 Cosine Similarity In Cosine similarity, items are treated as vectors and the similarity between items are computed by the Cosine of the angle between corresponding vectors. Cosine similarity of item i and j is given as, r u,i r u,j sim(i, j) = ru,i 2 3.1.2 Pearson Correlation ru,j 2 As the most popular similarity measurement, Pearson correlation of item i and j is given as, (r u,i r i ) (r u,j r j ) sim(i, j) = (r u,i r i ) 2 (r u,j r j ) 2 where r i and r j is the average rating of the i-th and j-th item respectively. 3.1.3 Adjusted Cosine Similarity In Cosine and Pearson similarity the user variance is not considered, but it might have a significant impact. Because some users are easy to give high ratings

while other critical customers are reluctant to do so even though they like the item. To eliminate the user rating variance, Adjusted Cosine similarity of item i and j is given as, sim(i, j) = (r u,i r u ) (r u,j r u ) (r u,i r u ) 2 (r u,j r u ) 2 where r u denotes the average rating of user u for all the items he rated. 3.1.4 SVD Item-Feature Similarity Regularized Singular Value Decomposition (SVD) algorithm was proposed by Simon Funk. The prediction for user u on item i is made by, r u,i = U T u I i where U and I are M-dimensional vectors of users and items. We define SVD item-feature similarity of item i and j as the similarity of corresponding item vector V i and V j, sim(i, j) = M m=1 M (V i,m V i ) (V j,m Īj) (V i,m V M i ) 2 (V j,m V j ) 2 m=1 3.2 Similarity Correction m=1 It is obvious that two items are not thought similar when only a few customers purchased or rated both. We suggest including the co-support of items in computation of item similarity. We apply the support correction to Cosine, Pearson, Adjusted Cosine and SVD item-feature similarity functions. The new similarity function with support correction is defined as, log(n ij ) sim(i, j) where N ij is the number of users who rate both item i and j. 3.3 Prediction Once the item similarity is calculated, we predict the rating of user u on item i from his previous ratings on the top K neighbor items to i, which is denoted as I K. 3.3.1 Weighted Average The rating of user u on item i is predicted as the average of weighted ratings of items in I K by user u. The weight function is the item similarity between item i and neighbor item j. r u,i = 3.3.2 Item Effects r u,j sim(i, j) sim(i, j) In weighted average prediction, the item variance is not considered. However, similar to the user variance described in section 3.1.3, the item variance exists and has the significant impact. A popular item might make the user give a high rating because of conformity, the psychological phenomenon in which an individual s behavior is influenced by other people. In order to remove the item effect in the prediction, we replace r u,j in weighted average prediction with r u,j r j + r i, r u,i = (r u,j r j + r i ) sim(i, j) sim(i, j) 3.3.3 Linear Regression The raw prediction results could be further regressed by the linear model, r u,i ˆ = αr u,i + β + ɛ The regression model parameter α and β are determined by solving the following least squares problem, min (u,i) Q 4 EXPERIMENT (r u,i r u,i ˆ ) 2 4.1 Data Set and Quality Evaluation The training data set of Netflix Prize [11] consists of 7 years of 100,480,507 ratings which were rated by 480,189 randomly-chosen, anonymous Netflix customers on 17,770 movies. A total of 2,817,131 ratings are provided as the test data set, half as quiz and the other half as test. We evaluate the Root Mean Square Deviation (RMSE) score on the Netflix test data set.

Figure 2: Minimum RMSE on Similarity Algorithms Figure 1: RMSE based on Neighbor Size 4.2 Experimental Results 4.2.1 Experiment on Neighborhood Size From our previous work, we knew the size of the neighborhood of Collaborative Filtering algorithm has a significant impact on the prediction quality. Figure 1 shows the RMSE scores of Cosine, Pearson, Adjusted Cosine and SVD item-feature item similarity algorithms with 10, 20, 30, 40 and 50 neighbors on Netflix data. From the experimental results, it can be observed that the optimal size of neighborhood ranges from 20 to 30 for all the similarity methods. The accuracy of the prediction improves when the neighborhood size increases from 10 to 30 since more similar movies are included. As the neighborhood size increases and more non-relevant movies are included, the RMSE score drops and the prediction quality deteriorates. The detailed RMSE scores on Netflix test data are shown in table 1. K Cosine Pearson Adj.Cos SVD IF 10 1.0742 1.0092 0.9786 0.9865 20 1.0629 1.0006 0.9685 0.9900 30 1.0602 1.0019 0.9666 0.9972 40 1.0592 1.0043 0.9660 1.0031 50 1.0589 1.0064 0.9658 1.0078 Table 1: RMSE on Neighbor Size 4.2.2 Experiment on Similarity Algorithms Adjusted Cosine always achieves better results over Cosine, Pearson and SVD item-feature similarity. The best RMSE score of Cosine, Pearson, Adjusted Cosine and SVD item-feature similarity algorithms are shown in Figure 2. The Adjusted Cosine similarity algorithm discards the user variance and hence achieves better prediction accuracy. 4.2.3 Experiment on Similarity Correction We apply the similarity corrections described in Section 3.2 to Cosine, Pearson, Adjusted Cosine and SVD item-feature similarity algorithms. It is observed that Cosine, Pearson and SVD item-feature similarity algorithms get better RMSE scores when similarity corrections are included. However Adjusted Cosine similarity does not. Detailed RMSE scores are shown in Table 2. Pearson similarity receives nearly 2.8% enhancements. 4.2.4 Experiment on Item Effects We try the experiments on RMSE scores of Cosine, Pearson, Adjusted Cosine and SVD item-feature similarity with item effects. The result shows conformity does exist and item effects have a significant impact on the prediction accuracy. Table 3 shows the detailed RMSE improvements on Cosine, Pearson, Adjusted Cosine and SVD item-feature similarity functions with item effects. There is about 10% enhencement for Cosine similarity, 5.5% for Pearson and 5% for SVD item-feature similarity. Table 4 shows the best RMSE scores of Cosine, Pearson, Adjusted Cosine and SVD item-feature similarity algorithms with similarity correction and item effects. Cosine, Pearson and SVD item-feature similarity CF algorithms achieve the best results by including similarity corrections and item effects. We only include the item effects for Adjusted Cosine similarity function since it does not benefit from support based similarity correction.

Cosine Pearson Adj.Cos SVD IF Before Correction 1.0589 1.0006 0.9658 0.9865 After Correction 1.0588 0.9726 1.0637 0.9791 Improvement(%) 0.009 2.798-10.137 0.750 Table 2: Similarity Correction Cosine Pearson Adj.Cos SVD IF w/o Item Effects 1.0589 1.0006 0.9658 0.9865 with Item Effects 0.9575 0.9450 0.9468 0.9381 Improvement(%) 9.576 5.557 1.967 4.906 Table 3: Item Effects Cosine Pearson Adj.Cos SVD IF Correction Include Include Exclude Include Item Effects Include Include Include Include Best RMSE 0.9526 0.9306 0.9468 0.9251 Table 4: Similarity Correction & Item Effects 4.2.5 Experiment on Regression The raw prediction results from Cosine, Pearson, Adjusted Cosine and SVD Item-feature similarity functions are regressed by the linear model and Table 5 shows the results and regression parameter α and β. Cosine Pearson Adj.Cos SVD IF α 0.9624 0.9337 0.9115 0.9608 β 0.0822 0.1819 0.2809 0.0664 RMSE 0.9505 0.9271 0.9437 0.9212 Table 5: Regression A boosting algorithm is further implemented and the final RMSE score reaches 0.9166. Table 6 shows the weight for each similarity function. Weight Cosine -0.1730 Pearson 0.3331 Adj.Cos 0.1224 SVD IF 0.6801 Table 6: Boosting Weights 5 CONCLUSION In this paper we show the extension study of itembased similarity Collaborative Filtering algorithm on Netflix Prize data. The experiments implement Cosine, Pearson, Adjusted Cosine and SVD item-feature algorithms. Each algorithm is implemented in P-Tree with different neighborhood sizes. The experiment suggests optimal neighborhood size ranges from 20 to 30. The results also show support based similarity corrections and item effects significantly improve the prediction accuracy of Cosine, Pearson and SVD item-feature algorithms. When the similarity corrections and item effects are included in the similarity functions, Pearson and SVD item-feature Collaborative Filtering algorithms achieve more accurate predictions. 6 ACKNOWLEDGMENTS We would like to thank all DataSURG members for their hard work on P-Tree API. We also thank the Center of High Performance Computing at North Dakota State University, which provides the computing facility in our experimentation. References [1] G. Adomavicius and A. Tuzhilin, Towards the Next Generation of Recommation System: A Survey of the State-of-art and Possible Extensions, IEEE Transactions on Knowledge and Data Engineering, pp. 634-749, 2005 [2] G. Karypis, Evaluation of Item-Based Top-N Recomm Algorithms, Proceeding of the 10th

International Conference on Information and Knowledge Mangement, pp 247-254, 2001 [3] B. Sarwar, G. Karypis, J. Konstan and J. Riedl, Item-Based Collaborative Filtering Recommation Algorithms, Proceedings of the 10th International Conference on World Wide Web, pp 285-295, 2001 [4] M. Deshpande and G. Karypis, Item-base Top-N Recommation Algorithms, ACM Transactions on Information Systems, Vol. 22, Issue 1, pp 143-177, 2004 [5] R. M. Bell and Y. Koren, Improved Neighborhood-based Collaborative Filtering, KDD Cup 2007, pp 7-14, 2007 [6] R. M. Bell, Y. Koren and C. Volinsky, Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights, 7th IEEE International Conference on Data Mining, pp 43-52, 2007 [7] D. Billsus and M. J. Pazzani, Learning Collaborative Information Filters, Proceeding of the 15th International Conference on Machine Learning, pp 46-54, 1998 [14] Q. Ding, M. Khan, A. Roy, and W. Perrizo, The P-tree Algebra, Proceedings of the ACM Symosium on Applied Computing, pp 426-431, 2002 [15] A. Perera and W. Perrizo, Parameter Optimized, Vertical, Nearest Neighbor Vote and Boundary Based Classification, CATA, 2007 [16] A. Perera, T. Abidin, G. Hamer and W. Perrizo, Vertical Set Square Distance Based Clustering without Prior Knowledge of K, 14th International Conference on Intelligent and Adaptive Systems and Software Engineering (IASSE 05), Toronto, Canada, 2004 [17] I. Rahal and W. Perrizo, An Optimized Approach for KNN Text Categorization using P-Tree, Proceedings of the ACM Symposium on Applied Computing, pp 613-617, 2004 [18] A. Perera, A. Denton, P. Kotala, W. Jockheck, W. Valdivia and W. Perrizo, P-tree Classification of Yeast Gene Deletion Data, SIGKDD Explorations, Vol. 4, Issue 2, 2002 [8] D. Goldberg, D. Nichols, B. M. Oki and D. Terry, Using Collaborative Filtering to Weave and Information Tapestry, Communications of the ACM 35, pp 61-70, 1992 [9] J. S. Breese, D. Heckerman and C. Kadie, Empirical Analysis of Predictive Algorithms for Collaborative Filtering, Proceeding of the 14th Annual Conference on Uncertainty in Artificial Intelligence, pp 43-52, 1998 [10] J. Wang, A. P. de Vries and M. J. T. Reinders, Unifying User-based and Item-based Collaborative Filtering Approaches by Similarity Fusion, Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp 501-508, 2006 [11] Netflix Prize, http://www.netflixprize.com [12] DataSURG, P-tree Application Programming Interface Documentation, North Dakota State University. http://midas.cs.ndsu.nodak.edu/ ~datasurg/ptree/ [13] Tinda Lu, William Perrizo, Yan Wang, Experimental Study on Item-based P-Tree Collaborative Filtering Algorithm for Netflix, SEDE, 2009