Introduction. Chapter Background Recommender systems Collaborative based filtering

Size: px

Start display at page:

Download "Introduction. Chapter Background Recommender systems Collaborative based filtering"

Johnathan Parsons
6 years ago
Views:

1 ii Abstract Recommender systems are used extensively today in many areas to help users and consumers with making decisions. Amazon recommends books based on what you have previously viewed and purchased, Netflix presents you with shows and movies you might enjoy based on your interactions with the platform and Facebook serves personalized ads to every user based on gathered browsing information. These systems are based on shared similarities and there are several ways to develop and model them. This study compares two methods, user and item-based filtering in k nearest neighbours systems. The methods are compared on how much they deviate from the true answer when predicting user ratings of movies based on sparse data. The study showed that none of the methods could be considered objectively better than the other and that the choice of system should be based on the data set.

2 Chapter 1 Introduction 1.1 Background In everyday life, it is often necessary to make choices without sufficient personal experience of the alternatives. We then rely on recommendations from other people to make as smart choices as possible. E.g., when shopping at a shoe store, a customer could describe features of previously owned shoes to a clerk and then the clerk would make recommendations for new shoes based on the customer s past experiences. A dedicated clerk could, besides providing recommendations, also remember past choices and experiences of customers. This would allow the clerk to make personalised recommendations to returning customers. The way we transform this experience to the digital era is by using recommender systems [1] Recommender systems Recommender systems can be viewed as a digital representation of the clerk in the previous example. The goal of a recommender system is to make predictions of what items users might be interested in by analysing gathered data. Gathering data can be done with an implicit and/or an explicit approach. An implicit approach records users behaviour when reacting to incoming data (e.g. by recording for how long a user actually watched a movie before switching to something else). This can be done without user knowledge. The explicit approach depends on the user explicitly specifying their preferences regarding items, e.g. by rating a movie. Input to a recommender system is the gathered data and the output is a prediction or recommendation for the user [2]. A recommender system s predictions will generally be more accurate the more data it can base its predictions on. Having a small amount of data to base predictions on is known as the sparse data problem and is expanded upon in section Collaborative based filtering Collaborative Filtering (CF) is a common algorithm used in recommender systems. CF provides predictions and recommendations based on other users and/or items in the system. We assume that similar users or items in the system can be used to predict each other s ratings. If we know that Haris likes the same things as Alex and Alex also likes candy then we can predict that Haris will most likely also enjoy candy [3, 4]. 1

3 2 CHAPTER 1. INTRODUCTION Two common methods for implementing collaborative filtering are user and itembased filtering. Both of these methods create a similarity matrix where the similarities between users (or items) is calculated and stored in a matrix. The distance (similarity) between users can be calculated in several ways and two common methods are the Pearson correlation coefficient or the cosine similarity Calculating the similarity between users To calculate how similar users are, a matrix is used where the users are rows and different items are columns. One can then look at how similar users are by comparing their ratings for every item. Below is an example matrix and table with 3 users (Amy, Bill and Jim) and only 2 items (Snow Crash and Girl with the Dragon Tattoo). Figure 1.1: Comaprison matrix [guidetodatamining.com] Figure 1.2: Comaprison table [guidetodatamining.com] The figures 1.1 and 1.2 show Bill and Jim having more in common than any other pair. There are several ways to give a value to this similarity. Some common approaches are:

4 CHAPTER 1. INTRODUCTION 3 Manhattan distance The Manhattan distance is a simple form of similarity calculation. It is the sum of the differences between ratings in every axis. In the above case, where the matrix is in 2D, the Manhattan distance between Bill, at index 1, and Jim, at index 2, would be: x 1 x 2 + y 1 y 2 = = 2 Euclidean distance The Euclidean distance uses the difference of every axis and applies the Pythagorean Theorem to calculate the "straight line distance" between two objects in the matrix. Pythagorean theorem: a 2 + b 2 = c 2 Euclidean distance between Jim, at index 1, and Amy, at index 3, is calculated with the equation: ( x1 x 3 2 ) + ( y 1 y 3 2 ) = ( ) + ( 1 5 ) 2 = Correlation An issue that isn t visualized by this example is what happens when there is incomplete data. As in, some users haven t rated some items of the matrix. If users A and B have rated the same 100 items but A and C only have 10 rated items in common, the similarity calculation between A and B should obviously be stronger as it is based on more data. Using the Manhattan or Euclidean distance however, this will not be accounted for, making these methods poor when data is missing [5]. To account for this, two other methods, Pearson correlation coefficient and cosine similarity can be used. Pearson correlation coefficient (PCC) The PCC draws a line between two users ratings to get a correlation value where a straight, increasing line represents a high correlation while a decreasing line shows that the compared units do not correlate much. Figure 1.3: Example of a correlation table [guidetodatamining.com] The figures 1.3 and 1.4, show an example of positive correlation. The Pearson correlation coefficient takes what is known as "grade inflation" into account [5]. This is the phenomenon of users rating things differently even though they feel the same way about them. In the above example, Weird Al is the band Clara dislikes the most yet they are still rated at 4. Robert also dislikes Weird Al but gives them a rating of 1. In the Manhattan or Euclidean calculations, this would represent a big difference between the users but

5 4 CHAPTER 1. INTRODUCTION Figure 1.4: Graphing the table shows a positive correlation [guidetodatamining.com] the graph shows that they are very much alike. When placing these 5 bands in order of preference, they agree completely. The formula for calculating PCC is: r = n (x i x)(y i ȳ) i=1 (1.1) n n (x i x) 2 (y i ȳ) 2 i=1 i=1 Cosine similarity Cosine similarity is another way of calculating the similarity between users preferences. Here the users and their ratings of items are represented as two vectors and their similarity is based on the cosine of the angle between them. Cosine similarity is often used for recommender systems since it ignores items which both users haven t rated, so called 0-0 matches, which are in abundance when dealing with sparse data. The cosine similarity is calculated as: cos( x, y ) = x y x y Where the dot in the numerator represents the dot product and x in the denominator indicates the length of vector x k Nearest Neighbours (knn) K nearest neighbours is the method of looking at some number (k) of users or items that are similar to make predictions. Meaning that not all users, or items, are accounted for when making a prediction. The difference between user or item-based filtering is creating a matrix of similar users or similar items. Similar users are users who often share sentiment/rating of items. When recommender systems were first developed, user-based filtering was used but it has issues with scalability. As the amount of data increases, calculating the similarity matrix raises exponentially. To combat this, Amazon developed item-based filtering which labels similar items into groups so that once a user rates some (1.2)

6 CHAPTER 1. INTRODUCTION 5 item highly, the algorithm recommends other similar items from the same group. Itembased filtering scales better than the user-based approach [3, 5, 6] Evaluation Two common methods for evaluating recommender systems are used in this study. The Root Mean Squared Error (RMSE) is calculated by: RMSE = 1 n n d 2 i (1.3) i=1 and the Mean Absolute Error (MAE) is calculated by: MAE = 1 n n d i (1.4) Where n is the number of predictions made and d is the distance between the recommender system s prediction and the correct answer. The closer the RMSE and MAE values are to 0 the better accuracy the recommender system has. RMSE disproportionally penalizes large errors while MAE does not mirror many small errors properly so both measurements should be used when evaluating the accuracy [7, 8, 9]. To provide test data for evaluation, a dataset is divided into two parts. One part is used for building the similarity matrix and the other part is used for evaluation Sparse data problem Sparse data is a common problem in recommender systems where the dataset consists of few ratings compared to the number of users. This issue was simulated by splitting the dataset into two asymmetric parts. The smaller part is then used to make predictions for all objects in the larger part [10]. i=1 1.2 Datasets Three datasets where used in this study. These are all datasets involving user ratings of movies. The datasets have all been previously used in studies about recommender systems [10]. The datasets are: FilmTrust FilmTrust was an old film rating website that has now been shut down. The data was crawled from the FilmTrust website in June 2011 as part of a research paper on recommender systems [11]. The FilmTrust database has users and items. There is a total of ratings where the scale goes from 1 to 5. CiaoDVD CiaoDVD was a DVD rating website where users could share their reviews of movies and give recommendations for stores with the best prices. The data was crawled from dvd.ciao.co.uk in December 2013 as part of a research paper on trust prediction [12]. The

7 6 CHAPTER 1. INTRODUCTION CiaoDVD database has 920 users and items. There is a total of ratings and the scale goes from 1 to 5. MovieLens MovieLens is a well-known dataset used in many scientific papers. It consists of a collection of movie ratings from the MovieLens web site. The dataset was collected over various periods of time [13]. The MovieLens database has users and items. There are a total number of ratings and the scale goes from 1 to 5. In this dataset, all users have rated at least 20 items. 1.3 Surprise There are multiple free and available to use implementations of recommender systems. The algorithms in this study was implemented using the python library Surprise [14]. Surprise is licensed under the BSD 3-Clause license [15]. 1.4 Purpose The study compares how well the two collaborative based filtering systems user-based and item-based perform when predictions are based on sparse data, known as the sparse data problem. The sparse data problem is a common one in the field of machine learning [16] and understanding how effective these different methods are, is of great value for future implementations. 1.5 Research question How do the two filtering systems user-based and item-based compare when making predictions based on sparse data? 1.6 Scope and constraints The different datasets that were used are from MovieLens, FilmTrust and CiaoDVD. The python library Surprise was used to conduct all tests. This study will only compare the correctness of predictions when these are based on sparse data. Other factors such as speed and memory efficiency will not be taken into consideration. The correctness will be measured using the RMSE and MAE.

8 Chapter 2 Method Running the two filtering methods, user and item-based filtering, on a dataset is henceforth referred to as a "test". Every test was conducted 10 times with randomized sets of training and test data. The mean value of these 10 runs represent the result of a test. 2.1 Data handling Before use, the data needed processing. Following are the methods used to prepare the data for testing Simulating sparse data In the study, sparse data is defined by using 20% of the dataset for training and 80% for verification. This ratio has been used in similar studies [17] Formatting data The dataset provided from MovieLens and FilmTrust use a format that Surprise can handle natively. The dataset from CiaoDVD was formatted before use. The python script in appendix B.3 was used to retrieve only the columns with user id, movie id and rating Creating test data The data was split using a python script, see appendix B.2, that first read all the data from file into an array. Then a shuffle of the array was done by providing a seed value, ranging from 1 to 10, to the shuffle function in the python library. After that every fifth rating (20%) was written to one file and the rest was written to another. The smaller file was then used as training data for the recommender system and the bigger file was used as test data. This was repeated 10 times with different seeds for each dataset. 2.2 Conducting the tests The created test and training datasets were used to build models, run the prediction algorithm and evaluate the result. See appendix B.1 for code. 7

9 8 CHAPTER 2. METHOD Building similarity model A PCC and cosine similarity model was built for each dataset. Note that the models had to be created for each dataset and only one model could be evaluated in each run. This was configured with built in functions in the Surprise library Building the prediction algorithm Built-in methods in Surprise were used to create the prediction algorithm. In table 2.1 the configurations for the different prediction algorithms are shown. All setups used a minimum of 1 neighbour for predictions. Test Filtering method Similarity model Max Neighbours used 1 Item-based cosine 40 2 User-based cosine 40 3 Item-based pearson 40 4 User-based pearson 40 Table 2.1: Configurations for prediction algorithms Evaluating the algorithms Evaluation of the algorithms was done with the built-in function, evaluate(), in the Surprise library. Each test was run with all (10) test and training data combinations for each dataset. For both correlation evaluations (PCC and cosine similarity) and each dataset a mean value for the RMSE and MAE score was calculated based on the evaluation of the 10 different seeded partitions of the data. An average was used to prevent strong influences from deviating scores in the case of bad data in the results.

Chapter 3 Results The following structure will be used to present the results of the study: Two sections are used showing results based on each of the similarity matrix structures, Pearson

10 Chapter 3 Results The following structure will be used to present the results of the study: Two sections are used showing results based on each of the similarity matrix structures, Pearson correlation coefficient (Pearson) or cosine similarity (Cosine). For all datasets, user and item-based filtering will be compared side by side in a plot for each metric, MAE or RMSE. The plot shows the average value of the 10 test runs. The lower the value, the better predictions have been made. Following the plot of average scores there is another plot which shows the max deviation for the scores. This is the difference between the highest and lowest score of the 10 test runs for each dataset and filtering method. The lower the difference, the smaller the spread which has been observed between different test runs. This plot is included to give an idea of how much the tests varied which is relevant as we use an average value. The full metrics of the tests are presented in appendix A. 3.1 Pearson The following results were obtained using the Pearson method for the similarity matrix. Figure 3.1: MAE, Pearson 9

11 10 CHAPTER 3. RESULTS The plot in figure 3.1 shows the results for the MAE scores. The plot shows a small advantage for item-based filtering for the FilmTrust dataset while there is an opposite advantage for the MovieLens dataset. For the CiaoDVD dataset user and item-based based filtering score about the same. Figure 3.2: Max MAE score deviation for Pearson The difference plot in figure 3.2 shows that the difference of the max and min value is less than for all the datasets. FilmTrust has highest value for user-based filtering. The scores have a deviation of around 3%. The plot also shows that there is a big difference for user and item-based deviation for FilmTrust. Figure 3.3: RMSE, Pearson The RMSE scores, plotted in figure 3.3, give hints about the same trends as the MAE scores. The dataset for FilmTrust had better accuracy when item-based filtering was used

12 CHAPTER 3. RESULTS 11 and MovieLens had better accuracy when user-based was used. CiaoDVD had about the same accuracy for both filtering methods. Figure 3.4: Max RMSE score deviation for Pearson The difference plot in figure 3.4 shows the same max deviation for the FilmTrust dataset with less than difference between the max and min values. The difference between the user and item-based approaches for the FilmTrust dataset which was observed in figure 3.2 is present here as well.

13 12 CHAPTER 3. RESULTS 3.2 Cosine The following results were obtained using the cosine similarity method for the similarity matrix. Figure 3.5: MAE, Cosine In figure 3.5 the same trend which was observed for the pearson matrices in figure 3.1 are still visible. However, user and item-based filtering scored slightly closer to each other. Figure 3.6: Max MAE score deviation for cosine For the cosine similarity matrix, the difference between the max and min scores are much closer than for the Pearson similarity matrices. From figure 3.6 we see that the max score deviation is less than 0.01 points. However, there is a slightly lesser deviation for

14 CHAPTER 3. RESULTS 13 item-based filtering for all datasets. Notice that the big deviation for user-based filtering for the FilmTrust dataset which was observed when using the Pearson method is not present here. Figure 3.7: RMSE, Cosine The RMSE score using the cosine similarity matrix plotted in figure 3.7 shows the same trends as the RMSE score for the Pearson similarity matrix in figure 3.3. Figure 3.8: Max RMSE score deviation for cosine As opposed to the MAE score we see a slightly smaller deviation of the scores for user-based filtering. The deviation is less than 0.01 points which is very low.

15 Chapter 4 Discussion The discussion section has been divided into three parts with one part discussing our results and how the study was conducted, one part talking about external dependencies and the last part analysing the current state of the art and the relevancy of the study. Figures show a clear pattern where neither user nor item-based filtering has a clear advantage over the other, independent of error and correlation measurements (MAE, RMSE and Pearson, cosine). The results suggest that the choice of filtering method should be based on the data set. Exactly what properties of the data set that one should look for when determining filtering method is hard to say based on this study as it only contains 3 different ones with several differences (making it hard to pinpoint determining factors). Our experiments show a clear correlation between the two error measurements where both give the same result for every dataset on what filtering method performed best. The MAE scores being lower than the respective RMSE ones across the board is expected as MAE can never produce a higher value than RMSE, only an equal one (if all errors have the same magnitude). The maximum k value for the k-nearest neighbours algorithm which denotes how many items or users one makes the recommendations based on was chosen to be 40 in all tests. Choosing the optimal k value is not a simple task and there are many suggestions for how one should go about doing it but no agreed upon best method [18]. Using cross validation with different k values and comparing results is one recommended method but this approach depends on the data set. Since different data sets are used in this study, different k values might be needed for the datasets to enable the system to perform at optimal capacity. Other ways of calculating an optimal k value are discussed in [19]. Calculating an optimal k value for every data set was considered outside of this study s scope and the default value of the Surprise library (40) was used instead. This value is, as stated, the maximum number of neighbours which the algorithm will consider. If there are not 40 users (or items) which are similar enough to be considered neighbours, Surprise will use a lower amount (to a minimum of 1). Using a different maximum k value may have an impact on the results if this study s experiments are to be remade. Every test result is a mean average of 10 runs where the training and test data sets were randomized. This method was used because it was a fair compromise when considering its correctness and the scope of the study. One can naturally get a more statistically sound value by averaging 1000 test runs instead of 10 but running the tests is 14

16 CHAPTER 4. DISCUSSION 15 time consuming (computationally) and it is hard to set a limit for how many data points are needed for a fair assessment. One more thing which our method doesn t account for is outliers which can skew the mean considerably. However, only running each test 10 times allowed us to see that no big statistical outliers were present in the mean calculations. This is shown in the figures (3.2, 3.4, 3.6, 3.8) 4.1 External dependencies Two of the datasets, FilmTrust and CiaoDVD, were acquired from a scientific paper and not taken directly from their respective source. They were both collected by crawling the websites while these were online (they have been shut down at the time of writing). This makes it hard to control the correctness of the data. The dataset from CiaoDVD came in a non-compatible format for the python program so the data had to be processed and formatted which leaves room for human error. An important attribute of the MovieLens dataset is that all users have made at least 20 ratings. There are no known similar minimum thresholds for the other datasets. To raise the confidence of the drawn conclusions, more datasets should be used of varying sizes and from areas other than movie ratings. Initially the paper included a dataset from Yelp of restaurant reviews but because of its different data format and time restrictions, this dataset could not be used in this study. We have no reason to doubt the Surprise software. All our tests have returned reasonable results and Surprise looks like a professionally built product for all intents and purposes. It is open source, actively maintained (latest commit was within 24 hours of writing ( )), well documented and written by a Ph.D. student at IRIT (Toulouse Institute of Computer Science Research). To confirm the accuracy of the software, one can use the same data sets and algorithms of this study and input these into another working recommender system and check if the results are identical. 4.2 State of the art and relevancy Many companies use recommender systems today. Some bigger ones are Amazon, Facebook, Linkedin and Youtube. Finding out exactly what algorithms these companies use and how they are implemented has proven very difficult. There are two major reasons for this. One is that such information is part of their (often) closed source code. The other is that there is no simple answer to the question as most modern recommender systems are based on a plethora of algorithms. One famous case where this was displayed was the Netflix Prize, a contest for developing a better recommender system for Netflix with a price pool of a million dollars [20]. The best (winning) algorithms were in fact never implemented by Netflix as their huge complexity and engineering effort required overshadowed the slightly better predictions they would bring [21]. The relevancy of the study can be questioned since its scope is quite narrow. Limiting itself to only comparing the accuracy of the two methods and dismissing other factors such as memory efficiency and computational demand/speed may make the results irrelevant if one of the methods can t ever be feasibly applied because of such limitations. However, even if such limitations do exist, this and similar studies could provide valuable insight for if pursuing a solution to such limitations is worth putting effort into.

Part 11: Collaborative Filtering. Francesco Ricci

Part 11: Collaborative Filtering. Francesco Ricci Part : Collaborative Filtering Francesco Ricci Content An example of a Collaborative Filtering system: MovieLens The collaborative filtering method n Similarity of users n Methods for building the rating