Collaborative Filtering based on User Trends

Similar documents
Content-based Dimensionality Reduction for Recommender Systems

Collaborative recommender systems: Combining effectiveness and efficiency

Feature-weighted User Model for Recommender Systems

Nearest-Biclusters Collaborative Filtering

Justified Recommendations based on Content and Rating Data

New user profile learning for extremely sparse data sets

Application of Dimensionality Reduction in Recommender System -- A Case Study

The Design and Implementation of an Intelligent Online Recommender System

Comparison of Recommender System Algorithms focusing on the New-Item and User-Bias Problem

Reproducing and Prototyping Recommender Systems in R

Movie Recommender System - Hybrid Filtering Approach

Recommender Systems: User Experience and System Issues

Collaborative Filtering Based on Iterative Principal Component Analysis. Dohyun Kim and Bong-Jin Yum*

Robustness and Accuracy Tradeoffs for Recommender Systems Under Attack

A Time-based Recommender System using Implicit Feedback

Recommender Systems (RSs)

BordaRank: A Ranking Aggregation Based Approach to Collaborative Filtering

Improving Results and Performance of Collaborative Filtering-based Recommender Systems using Cuckoo Optimization Algorithm

Comparing State-of-the-Art Collaborative Filtering Systems

Hotel Recommendation Based on Hybrid Model

A Scalable, Accurate Hybrid Recommender System

GENETIC ALGORITHM BASED COLLABORATIVE FILTERING MODEL FOR PERSONALIZED RECOMMENDER SYSTEM

Semantically Enhanced Collaborative Filtering on the Web

Real-time Collaborative Filtering Recommender Systems

A Content Vector Model for Text Classification

Two Collaborative Filtering Recommender Systems Based on Sparse Dictionary Coding

Mining Web Data. Lijun Zhang

Project Report. An Introduction to Collaborative Filtering

ARecommender system is a program that utilizes algorithms

A Recursive Prediction Algorithm for Collaborative Filtering Recommender Systems

An Empirical Comparison of Collaborative Filtering Approaches on Netflix Data

Extension Study on Item-Based P-Tree Collaborative Filtering Algorithm for Netflix Prize

Collaborative Filtering: A Comparison of Graph-Based Semi-Supervised Learning Methods and Memory-Based Methods

LRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier

Data Sparsity Issues in the Collaborative Filtering Framework

Part 12: Advanced Topics in Collaborative Filtering. Francesco Ricci

Performance of Recommender Algorithms on Top-N Recommendation Tasks

arxiv: v1 [cs.ir] 1 Jul 2016

Applying a linguistic operator for aggregating movie preferences

The Tourism Recommendation of Jingdezhen Based on Unifying User-based and Item-based Collaborative filtering

Property1 Property2. by Elvir Sabic. Recommender Systems Seminar Prof. Dr. Ulf Brefeld TU Darmstadt, WS 2013/14

Interest-based Recommendation in Digital Library

Proposing a New Metric for Collaborative Filtering

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

Part 11: Collaborative Filtering. Francesco Ricci

Achieving Better Predictions with Collaborative Neighborhood

Collaborative Filtering based on Dynamic Community Detection

Part 11: Collaborative Filtering. Francesco Ricci

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

An Efficient Neighbor Searching Scheme of Distributed Collaborative Filtering on P2P Overlay Network 1

An Efficient and Effective Algorithm for Density Biased Sampling

Recommender System using Collaborative Filtering Methods: A Performance Evaluation

Visualizing Recommendation Flow on Social Network

Slope One Predictors for Online Rating-Based Collaborative Filtering

CS435 Introduction to Big Data Spring 2018 Colorado State University. 3/21/2018 Week 10-B Sangmi Lee Pallickara. FAQs. Collaborative filtering

Solving the Sparsity Problem in Recommender Systems Using Association Retrieval

The Effect of Diversity Implementation on Precision in Multicriteria Collaborative Filtering

Influence in Ratings-Based Recommender Systems: An Algorithm-Independent Approach

Mining Web Data. Lijun Zhang

Hybrid Weighting Schemes For Collaborative Filtering

Review on Techniques of Collaborative Tagging

Grey forecast model for accurate recommendation in presence of data sparsity and correlation

A Constrained Spreading Activation Approach to Collaborative Filtering

Learning Bidirectional Similarity for Collaborative Filtering

Time Series Prediction as a Problem of Missing Values: Application to ESTSP2007 and NN3 Competition Benchmarks

SOM+EOF for Finding Missing Values

Seminar Collaborative Filtering. KDD Cup. Ziawasch Abedjan, Arvid Heise, Felix Naumann

José Miguel Hernández Lobato Zoubin Ghahramani Computational and Biological Learning Laboratory Cambridge University

Hybrid algorithms for recommending new items

Performance Comparison of Algorithms for Movie Rating Estimation

highest cosine coecient [5] are returned. Notice that a query can hit documents without having common terms because the k indexing dimensions indicate

Recommender Systems New Approaches with Netflix Dataset

Recommender Systems: User Experience and System Issues. About me. Scope of Recommenders. A Quick Introduction. Wide Range of Algorithms

Privacy-Preserving Collaborative Filtering using Randomized Perturbation Techniques

Collaborative Filtering via Euclidean Embedding

Adaptive k-nearest-neighbor Classification Using a Dynamic Number of Nearest Neighbors

Improving the Accuracy of Top-N Recommendation using a Preference Model

NLMF: NonLinear Matrix Factorization Methods for Top-N Recommender Systems

SHILLING ATTACK DETECTION IN RECOMMENDER SYSTEMS USING CLASSIFICATION TECHNIQUES

An Approximate Singular Value Decomposition of Large Matrices in Julia

Experiences from Implementing Collaborative Filtering in a Web 2.0 Application

USING OF THE K NEAREST NEIGHBOURS ALGORITHM (k-nns) IN THE DATA CLASSIFICATION

A PERSONALIZED RECOMMENDER SYSTEM FOR TELECOM PRODUCTS AND SERVICES

Towards QoS Prediction for Web Services based on Adjusted Euclidean Distances

Information Retrieval: Retrieval Models

A probabilistic model to resolve diversity-accuracy challenge of recommendation systems

Using Dimensionality Reduction to Improve Similarity. fefrom, May 28, 2001

Data Obfuscation for Privacy-Enhanced Collaborative Filtering

General Instructions. Questions

Using Data Mining to Determine User-Specific Movie Ratings

EigenRank: A Ranking-Oriented Approach to Collaborative Filtering

Available online at ScienceDirect. Procedia Technology 17 (2014 )

Explaining Recommendations: Satisfaction vs. Promotion

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS

Using Social Networks to Improve Movie Rating Predictions

Feature Selection Using Modified-MCA Based Scoring Metric for Classification

Neighborhood-Based Collaborative Filtering

An Empirical Study of Lazy Multilabel Classification Algorithms

Alleviating the Sparsity Problem in Collaborative Filtering by using an Adapted Distance and a Graph-based Method

Towards Time-Aware Semantic enriched Recommender Systems for movies

Transcription:

Collaborative Filtering based on User Trends Panagiotis Symeonidis, Alexandros Nanopoulos, Apostolos Papadopoulos, and Yannis Manolopoulos Aristotle University, Department of Informatics, Thessalonii 54124, Greece {symeon, alex, apostol, manolopo}@delab.csd.auth.gr Abstract. Recommender systems base their operation on past user ratings over a collection of items, for instance, boos, CDs, etc. Collaborative Filtering (CF) is a succesful recommendation technique. User ratings are not expected to be independent, as users follow trends of similar rating behavior. In terms of Text Mining, this is analogous to the formation of higher-level concepts from plain terms. In this paper, we propose a novel CF algorithm which uses Latent Semantic Indexing (LSI) to detect rating trends and performs recommendations according to them. Our results indicate its superiority over existing CF algorithms. 1 Introduction The information overload problem affects our everyday experience while searching for valuable nowledge. To overcome this problem, we often rely on suggestions from others who have more experience on a topic. In Web case, this is more manageable with the introduction of Collaborative Filtering (CF), which provides recommendations based on the suggestions of users who have similar preferences. Latent Semantic Indexing (LSI) has been extensively used in informational retrieval, to detect the latent semantic relationships between terms and documents. Thus, higher level concepts are generated from plain terms. In CF, this is analogous to the formation of users trends from individual preferences. In this paper, we propose a new algorithm that is based on LSI to produce a condensed model for the user-item matrix. This model comprises a matrix that captures the main user trends removing noise and reducing its size. Our contribution and novelty are summarized as follows: (i) based on Information Retrieval, we include the pseudo-user concept in CF (ii) We implement a novel algorithm, which tunes the number of principal components according to the data characteristics. (iii)we generalize the recommendation procedure for both user- and item-based CF methods. (iv) We propose a new top-n

2 Panagiotis Symeonidis et al. generation list algorithm based on SVD and the Highest Prediction Rated items. The rest of this paper is organized as follows. Section 2 summarizes the related wor, whereas Section 3 contains the analysis of the CF factors. The proposed approach is described in Section 4. Experimental results are given in Section 5. Finally, Section 6 concludes this paper. 2 Related wor In 1992, the Tapestry system (Goldberg et al. (1992)) introduced Collaborative Filtering (CF). In 1994, the GroupLens system (Resnic et al. (1994)) implemented a CF algorithm based on common users preferences. Nowadays, it is nown as user-based CF algorithm. In 2001, another CF algorithm was proposed. It is based on the items similarities for a neighborhood generation of nearest items (Sarwar et al. (2001)) and is denoted as item-based CF algorithm. Furnas et al. (1988) proposed Latent Semantic Indexing (LSI) in Information Retrieval area to detect the latent semantic relationships between terms and documents. Berry et al.(1994) carried out a survey of the computational requirements for managing (e.g., folding-in, which is a simple technique that uses existing SVD to represent new information.) LSI-encoded databases. He claimed that the reduced-dimensions model is less noisy than the original data. Sarwar et al.(2000) applied dimensionality reduction for only the userbased CF approach. In contrast to our wor, Sarwar et al. included test users in the calculation of the model as were apriori nown. For this reason, we introduce the notion of pseudo-user in order to insert a new user in the model (folding in), from which recommendations are derived. 3 Factors affecting CF In this section, we identify the major factors that critically affect all CF algorithms. Table 1 summarizes the symbols that are used in the sequel. Symbol Definition Symbol Definition number of nearest neighbors p u,i predicted rate for user u on item i N size of recommendation list c number of singular values I domain of all items A original matrix U domain of all users S Singular values of A u, v some users V Right singular vectors of A i, j some items A Approximation matrix of A I u set of items rated by user u u user vector U i set of users rated item i u new inserted user vector r u,i the rating of user u on item i n number of train users r u mean rating value for user u m number of items r i mean rating value for item i U Left singular vectors of A Table 1. Symbols and definitions.

Collaborative Filtering based on User Trends 3 Similarity measures: A basic factor for the formation of user/item neighborhood is the similarity measure. The most extensively used similarity measures are based on correlation and cosine-similarity (Sarwar et al (2001)). Specifically, user-based CF algorithms mainly use Pearson s Correlation (Equation 1), whereas for item-based CF algorithms, the Adjusted Cosine Measure is preferred (Equation 2)(McLauglin and Herlocer (2004)). (r u,i r u)(r v,i r v) i S sim(u, v) = u,i r u) i S(r 2, S = I u I v. (1) (r v,i r v) 2 i S (r u,i r u)(r u,j r u) u T sim(i, j) =, T = U i U j. (2) (r u,i r u) 2 u U i (r u,j r u) 2 u U j Generation of recommendation list: The most often used technique for the generation of the top-n list, is the one that counts the frequency of each item inside the found neighborhood, and recommends the N most frequent ones. Henceforth, this technique is denoted as Most-Frequent item recommendation (MF). MF can be applied to both user-based and item-based CF algorithms. Evaluation Metrics: Mean Absolute Error (MAE) has been used in most of related wors. Although, it has received criticism as well(mclauglin and Herlocer (2004)). Other extensively used metrics are precision (ratio of relevant and retrieved to retrieved) and recall (ratio of relevant and retrieved to relevant). We also consider F 1, which is another popular metric for CF algorithms and combines the two previous ones. 4 Proposed Method Our approach applies Latent Semantic indexing in CF process. To ease the discussion, we will use the running example illustrated in Figure 1 where I 1 4 are items and U 1 4 are users. As shown, the example data set is divided into train and test set. The null cells(no rating) are presented as zeros. I 1 I 2 I 3 I 4 U 1 4 1 1 4 U 2 1 4 2 0 U 3 2 1 4 5 I 1 I 2 I 3 I 4 U 4 1 4 1 0 Fig. 1. Train Set (n m), Test Set. Applying SVD on train data: Initially, we apply SVD on train data n m matrix A that produces three matrices. These matrices obtained by SVD can

4 Panagiotis Symeonidis et al. give by performing multiplication the initial matrix as the following Equation 3 and Figure 2 show: A n m = U n n S n m V m m (3) 4 1 1 4 1 4 2 0 2 1 4 5-0.61 0.28-0.74-0.29-0.95-0.12-0.74 0.14 0.66 8.87 0 0 0 0 4.01 0 0 0 0 2.51 0-0.47-0.28-0.47-0.69 0.11-0.85-0.27 0.45-0.71-0.23 0.66 0.13-0.52 0.39-0.53 0.55 A n m U n n S n m V m m Fig. 2. Example of: A n m (initial matrix A), U n m (left singular vectors of A), S n m (singular values of A), V m m (right singular vectors of A). Preserving the Principal Components: It is possible to reduce the n m matrix S to have only c largest singular values. Then, the reconstructed matrix is the closest ran-c approximation of the initial matrix A as it is shown in Equation 4 and Figure 3: A n m = U n c S c c V c m (4) 2.69 0.57 2.22 4.25 0.78 3.93 2.21 0.04 3.17 1.38 2.92 4.78-0.61 0.28-0.29-0.95-0.74 0.14 8.87 0 0 4.01-0.47-0.28-0.47-0.69 0.11-0.85-0.27 0.45 A n i U n c S c c V c m Fig. 3. Example of: A n m (approximation matrix of A), U n c (left singular vectors of A ), S c c (singular values of A ), V c m (right singular vectors of A ). We tune the number, c, of principal components (i.e., dimensions) with the objective to reveal the major trends. The tuning of c is determined by the information percentage that is preserved compared to the original matrix. Therefore, a c-dimensional space is created and each of the c dimensions corresponds to a distinctive rating trend. We have to notice that in the running example we create a 2-dimensional space using 83% of the total information of the matrix (12,88/15,39). Inserting a test user in the c-dimensional space: It is evident that, for user-based approach, the test data should be considered as unnown in the c-dimensional space. Thus a specialized insertion process should be used. Given the current ratings of the test user u, we enter pseudo-user vector in the c-dimensional space using the following Equation 5(Furnas et al (1988)). In the current example,we insert U 4 into the 2-dimensional space, as it is shown in Figure 4: u new = u V m c S 1 c c (5)

-0.23-0.89 u new 1 4 1 0 u Collaborative Filtering based on User Trends 5-0.47 0.11-0.28-0.85-0.47-0.27-0.69 0.45 V m c 0.11 0 0 0.25 S 1 c c Fig. 4. Example of: u new (inserted new user vector), u (user vector), V m c (two left singular vectors of V), S 1 c c (two singular values of inverse S). Generating the Neighborhood of users/items: Having a reduced dimensional representation of the original space, we form the neighborhoods of users/items in that space. For the user-based approach, we find the nearest neighbors of pseudo user vector in the c-dimensional space. The similarities between train and test users can be based on Cosine Similarity. First, we compute the matrix U n c S c c and then we perform vector similarity. This n c matrix is the c-dimensional representation for the n users. For the item-based approach, we find the nearest neighbors of item vector in the c-dimensional space. First, we compute the matrix S c c V c m and then we perform vector similarity. This c m matrix is the c-dimensional representation for the m items. Generating and Evaluating the Recommendation List: As it is mentioned in Section 3, existing raning criteria, such as MF, are used for the generation of the top-n list in classic CF algorithms. We propose a raning criterion that uses the predicted values of a user for each item. Predicted values are computed by Equations 6 and 7, for the cases of user-based and item-based CF, respectively. These equations have been used in related wor for the purpose of MAE calculation, whereas we use them for generation of top-n list. v U p u,i = r u + sim(u,v)(r v,i r v ) v U sim(u,v) (6) p u,i = r i + j I sim(i,j)(r u,j r j ) j I sim(i,j) (7) Therefore, we sort (in descending order) the items according to predicted rating value, and recommend the first N of them. This raning criterion, denoted as Highest Predicted Rated item recommendation (HPR), is influenced by the good accuracy of prediction that existing related wor reports through the MAE. HPR opts for recommending the items that are more probable to receive a higher rating. 5 Experimental configuration In the sequel, we study the performance of the described SVD dimensionality reduction techniques against existing CF algorithms. Both user-based and

6 Panagiotis Symeonidis et al. item-based algorithms are tested. Factors, that are treated as parameters, are the following: the neighborhood size (, default value 10), the size of the recommendation list (N, default value 20),and the size of train set (default value 75%). The metrics we use are recall, precision, and F 1. We perform experiments in a real data set that has been used as benchmars in prior wor. In particular, we examined MovieLens data set with 100,000 ratings assigned by 943 users on 1,682 movies. The range of ratings is between 1(bad)-5(excellent) of the numerical scale. Finally, the value of an unrated item is considered equal to zero. 5.1 Results for user-based CF algorithm Firstly, we compare existing user-based CF algorithm that uses Pearson similarity against two representative SVD reductions(svd50 and SVD10). These are percentages of the initial information we eep from the initial user-item matrix after applying SVD. The results for precision and MAE vs. are displayed in Figure 5a and b, respectively. Pearson SVD50 SVD10 Pearson SVD50 SVD10 Precision 80 70 60 50 40 30 20 10 0 MAE 1.1 1.05 1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 Fig. 5. Performance of user-based CF vs. : precision, MAE. As shown, the existing Pearson measure performs worst than SDV reductions. The reason is that the MovieLens data set is sparse and relatively large (high n value). The SVD reductions reveal the most essential dimensions and filter out the outliers and misleading information. We now examine the MAE metric. Results are illustrated in Figure 5b. Pearson yields the lowest MAE values. This fact indicates that MAE is good only for the evaluation of prediction and not of recommendation, as Pearson measure did not attain the best performance in terms of precision. Finally, we test the described criteria for the HSR top-n list generation algorithm. The results for precision and recall are given in Figure 6. As shown, the combination of the SVD similarity measure with HPR as list generation algorithm, clearly outperforms the Pearson with HPR. This is due to the fact that in the former the remaining dimensions are the determinative ones and outliers users have been rejected. Note that in the SVD50 we preserve only 157 basic dimensions instead of 708 train users for the latter.

Collaborative Filtering based on User Trends 7 Precision 70 60 50 40 30 20 10 0 Pearson + HPR SVD50 + HPR 10 20 30 40 50 60 70 80 90 100 Recall 18 16 14 12 10 8 6 4 2 0 Pearson + HPR SVD50 + HPR 10 20 30 40 50 60 70 80 90 100 Fig. 6. Comparison HPR criteria for the generation of top-n precision, recall list for user-based CF vs. 5.2 Results for item-based CF algorithms We perform similar measurements for the case of item-based CF. Thus, we examine the precision and recall for the existing Adjusted Cosine Measure (considers co-rated items) against SVD50 and SVD10 for the item-based case. The results are depicted in Figure 7 and are analogous to those of the userbased case. MAE and HPR results are also analogous to the user-based case. AdjustedCos SVD50 SVD10 AdjustedCos SVD50 SVD10 60 14.5 55 13.5 Precision 50 45 40 Recall 12.5 11.5 10.5 9.5 35 8.5 30 7.5 Fig. 7. Performance of item-based CF vs. : precision, Recall. 5.3 Comparative results We compared user-based and item-based SVD50. For the generation of the top-n list, for both of them we use MF. The results for F 1 are displayed in Figure 8a, which demonstrate the superiority of user-based SVD50 against item-based SVD50. Regarding the execution time, we measured the wall-cloc time for the on-line parts for all test users. The results vs. are presented in Figure 8b. Item based CF needs less time to provide recommendations than user-based CF. The reason is that a user-rate vector in user-based approach has to be inserted in the c-dimensional space. The generation of top-n list for the userbased approach further burdens the CF process.

8 Panagiotis Symeonidis et al. F1 0,3 0,28 0,26 0,24 0,22 0,2 IB-SVD50 UB-SVD50 Seconds 20 18 16 14 12 10 8 6 4 2 0 IB-SVD50 IB-AdjustedCos UB-SVD50 UB-Pearson Fig. 8. Comparison between item-based and user-based CF in terms of precision vs. and in terms of execution times. 6 Conclusions We performed experimental comparison of the proposed method against well nown CF algorithms, lie user-based or item-based methods (that do not consider trends), with real data sets. Our method show significant improvements over existing CF algorithms, in terms of accuracy (measured through recall/precision). In terms of execution times, due to the use of smaller matrices, execution times are dramatically reduced. In our future wor we will consider the issue of an approach that would combine user and item based approaches, attaining high accuracy recommendations in the minimum responding time. References Berry, M. Dumais, S. and OBrien G. (1994): Using linear algebra for intelligent information retrieval. SIAM Review,37(4):573-595. Furnas, G. Deerwester, S. Dumais. S. et al. (1988): Information retrieval using a singular value decomposition model of latent semantic structure.in Proc. ACM SIGIR Conf., pages 465-480. Goldberg, D. Nichols, D. Brian, M. and Terry D. (1992): Using collaborative filtering to weave an information tapestry. ACM Communications, 35(12):61-70. McLauglin R. and Herlocher J. (2004): A collaborative filtering algorithm and evaluation metric that accurately model the user experience.in Proc. ACM SIGIR Conf., pages 329-336. Resnic, P. Iacovou, N. Sucha, M. Bergstrom, P. and Riedl. J. (1994): Grouplens- An open architecture for collaborative filtering on netnews.in Proc. Conf. Computer Supported Collaborative Wor, pages 175-186. Sarwar, B. Karypis, G. Konstan, J. and Riedl. J. (2000): Application of dimensionality reduction in recommender system-a case study.in ACM WebKDD Worshop. Sarwar, B. Karypis, G. Konstan, J. and Riedl. J. (2001): Item-based collaborative filtering recommendation algorithms.in Proc. WWW Conf., pages 285-295.