Agenda. Collabora-ve Filtering (CF)

Size: px

Start display at page:

Download "Agenda. Collabora-ve Filtering (CF)"

Loren Marsh
5 years ago
Views:

1 - 1 -

2 Agenda Collabora-ve Filtering (CF) Pure CF approaches User-based nearest-neighbor The Pearson Correla9on similarity measure Memory-based and model-based approaches Item-based nearest-neighbor The cosine similarity measure Data sparsity problems Recent methods (SVD, Associa9on Rule Mining, Slope One, RF-Rec, ) The Google News personaliza9on engine Discussion and summary Literature - 2 -

3 Collabora-ve Filtering (CF) The most prominent approach to generate recommenda-ons used by large, commercial e-commerce sites well-understood, various algorithms and varia9ons exist applicable in many domains (book, movies, DVDs,..) Approach use the "wisdom of the crowd" to recommend items Basic assump-on and idea Users give ra9ngs to catalog items (implicitly or explicitly) Customers who had similar tastes in the past, will have similar tastes in the future - 3 -

4 Pure CF Approaches Input Only a matrix of given user item ra9ngs Output types A (numerical) predic9on indica9ng to what degree the current user will like or dislike a certain item A top-n list of recommended items - 4 -

5 User-based nearest-neighbor collabora-ve filtering (1) - 5 -

6 User-based nearest-neighbor collabora-ve filtering (2) Example A database of ra9ngs of the current user, Alice, and some other users is given: Item1 Item2 Item3 Item4 Item5 Alice ? User User User User Determine whether Alice will like or dislike Item5, which Alice has not yet rated or seen - 6 -

7 User-based nearest-neighbor collabora-ve filtering (3) Some first ques-ons How do we measure similarity? How many neighbors should we consider? How do we generate a predic9on from the neighbors' ra9ngs? Item1 Item2 Item3 Item4 Item5 Alice ? User User User User

8 Measuring user similarity (1) - 8 -

9 Measuring user similarity (2) Item1 Item2 Item3 Item4 Item5 Alice ? User User User User sim = 0.85 sim = 0.71 sim = 0.00 sim = Assumes that you don t include item5 in average - 9 -

10 Pearson correla-on Takes differences in ra-ng behavior into account 6 Alice 5 4 User1 User4 Ratings Item1 Item2 Item3 Item4 Works well in usual domains, compared with alterna-ve measures such as cosine similarity

11 Making predic-ons Only include reasonably similar users (>0). Also, average here includes ALL reasonably similar users ratings (not the same average as before)

12 Improving the metrics / predic-on func-on Not all neighbor ra-ngs might be equally "valuable" Agreement on commonly liked items is not so informa9ve as agreement on controversial items Possible solu-on: Give more weight to items that have a higher variance Value of number of co-rated items Use "significance weigh9ng", by e.g., linearly reducing the weight when the number of co-rated items is low Case amplifica-on Intui9on: Give more weight to "very similar" neighbors, i.e., where the similarity value is close to 1. Neighborhood selec-on Use similarity threshold or fixed number of neighbors

13 Memory-based and model-based approaches User-based CF is said to be "memory-based" the ra9ng matrix is directly used to find neighbors / make predic9ons does not scale for most real-world scenarios large e-commerce sites have tens of millions of customers and millions of items Model-based approaches based on an offline pre-processing or "model-learning" phase at run-9me, only the learned model is used to make predic9ons models are updated / re-trained periodically large variety of techniques used model-building and upda9ng can be computa9onally expensive item-based CF is an example for model-based approaches

14 Item-based collabora-ve filtering Basic idea: Use the similarity between items (and not users) to make predic9ons Example: Look for items that are similar to Item5 Take Alice's ra9ngs for these items to predict the ra9ng for Item5 Item1 Item2 Item3 Item4 Item5 Alice ? User User User User

15 The cosine similarity measure

Making predic-ons A common predic9on func9on: Neighborhood size is typically also limited to a specific size Not all neighbors are taken into account for the predic9on

16 Making predic-ons A common predic9on func9on: Neighborhood size is typically also limited to a specific size Not all neighbors are taken into account for the predic9on An analysis of the MovieLens dataset indicates that "in most real-world situa9ons, a neighborhood of 20 to 50 neighbors seems reasonable" (Herlocker et al. 2002)

17 Pre-processing for item-based filtering Item-based filtering does not solve the scalability problem itself Pre-processing approach by Amazon.com (in 2003) Calculate all pair-wise item similari9es in advance The neighborhood to be used at run-9me is typically rather small, because only items are taken into account which the user has rated Item similari9es are supposed to be more stable than user similari9es Memory requirements Up to N 2 pair-wise similari9es to be memorized (N = number of items) in theory In prac9ce, this is significantly lower (items with no co-ra9ngs) Further reduc9ons possible Minimum threshold for co-ra9ngs Limit the neighborhood size (might affect recommenda9on accuracy)

18 More on ra-ngs Explicit ra-ngs Probably the most precise ra9ngs Most commonly used (1 to 5, 1 to 7 Likert response scales) Research topics Op9mal granularity of scale; indica9on that 10-point scale is beger accepted in movie dom. An even more fine-grained scale was chosen in the joke recommender discussed by Goldberg et al. (2001), where a con9nuous scale (from 10 to +10) and a graphical input bar were used No precision loss from the discre9za9on User preferences can be captured at a finer granularity Users actually "like" the graphical interac9on method Mul9dimensional ra9ngs (mul9ple ra9ngs per movie such as ra9ngs for actors and sound) Main problems Users not always willing to rate many items number of available ra9ngs could be too small sparse ra9ng matrices poor recommenda9on quality How to s9mulate users to rate more items?

19 More on ra-ngs Implicit ra-ngs Typically collected by the web shop or applica9on in which the recommender system is embedded When a customer buys an item, for instance, many recommender systems interpret this behavior as a posi9ve ra9ng Clicks, page views, 9me spent on some page, demo downloads Implicit ra9ngs can be collected constantly and do not require addi9onal efforts from the side of the user Main problem One cannot be sure whether the user behavior is correctly interpreted For example, a user might not like all the books he or she has bought; the user also might have bought a book for someone else Implicit ra9ngs can be used in addi9on to explicit ones; ques9on of correctness of interpreta9on

20 Data sparsity problems Cold start problem How to recommend new items? What to recommend to new users? Straigh]orward approaches Ask/force users to rate a set of items Use another method (e.g., content-based, demographic or simply nonpersonalized) in the ini9al phase Default vo9ng: assign default values to items that only one of the two users to be compared has rated (Breese et al. 1998) Alterna-ves Use beger algorithms (beyond nearest-neighbor approaches) Example: In nearest-neighbor approaches, the set of sufficiently similar neighbors might be too small to make good predic9ons Assume "transi9vity" of neighborhoods

Example algorithms for sparse datasets Item1 Item2 Item3 Item4 Item5 Alice 5 3 4 4?

21 Example algorithms for sparse datasets Item1 Item2 Item3 Item4 Item5 Alice ? User ? User User User sim = 0.85 Predict ra9ng for User1-21 -

22 Graph-based methods (1) "Spreading ac-va-on" (Huang et al. 2004) Exploit the supposed "transi9vity" of customer tastes and thereby augment the matrix with addi9onal informa9on Assume that we are looking for a recommenda9on for User1 When using a standard CF approach, User2 will be considered a peer for User1 because they both bought Item2 and Item4 Thus Item3 will be recommended to User1 because the nearest neighbor, User2, also bought or liked it

23 Graph-based methods (2) "Spreading ac-va-on" (Huang et al. 2004) In a standard user-based or item-based CF approach, paths of length 3 will be considered that is, Item3 is relevant for User1 because there exists a three-step path (User1 Item2 User2 Item3) between them Because the number of such paths of length 3 is small in sparse ra9ng databases, the idea is to also consider longer paths (indirect associa9ons) to compute recommenda9ons Using path length 5, for instance

24 Graph-based methods (3) "Spreading ac-va-on" (Huang et al. 2004) Idea: Use paths of lengths > 3 to recommend items Length 3: Recommend Item3 to User1 Length 5: Item1 also recommendable

25 More model-based approaches Plethora of different techniques proposed in the last years, e.g., Matrix factoriza9on techniques, sta9s9cs singular value decomposi9on, principal component analysis Associa9on rule mining compare: shopping basket analysis Probabilis9c models clustering models, Bayesian networks, probabilis9c Latent Seman9c Analysis Various other machine learning approaches Costs of pre-processing Usually not discussed Incremental updates possible?

26 2000: Applica'on of Dimensionality Reduc'on in Recommender System, B. Sarwar et al., WebKDD Workshop Basic idea: Trade more complex offline model building for faster online predic-on genera-on Singular Value Decomposi-on for dimensionality reduc-on of ra-ng matrices Captures important factors/aspects and their weights in the data factors can be genre, actors but also non-understandable ones Assump9on that k dimensions capture the signals and filter out noise (K = 20 to 100) Constant -me to make recommenda-ons Approach also popular in IR (Latent Seman-c Indexing), data compression,

27 Matrix factoriza-on M = U Σ V T

28 Remember Alice? Item1 Item2 Item3 Item4 Item5 Alice ? User ? User User User SVD reduces dimensionality of problem. We argue that only important dimensions contribute to ratings; in this case 2 dimensions. Can perform SVD on this 4x4 matrix

29 Using JAMA: Interested in 1 st 2 Columns U Alice2D = Alice*U 2 *Σ 2-1 Alice = {5,3,4,4} Σ So, Alice2D = [0.64,-0.30] Can now use Alice2D to Predict other ratings. For example, use neighbors to predict ratings. V

30 What do the matrices represent? Matrix V corresponds to the users Matrix U corresponds to the products We are projec-ng a 4 D space to a 2D space We then weight influences by distance in this 2 D space

31 Example for SVD-based recommenda-on SVD: M = U Σ V k k k T k U k Dim1 Dim2 V T k Alice Dim Bob Dim Mary Sue Σ k Dim1 Dim2 Predic-on: rˆ ui = ru + U k ( Alice) Σk V = = 5.35 (on scale of 6) T k ( EPL) Dim Dim

32 1 0.8 Sue Terminator Twins 0.2 Eat Pray Love Bob Mary Alice Pretty Woman Die Hard

33 Discussion about dimensionality reduc-on (Sarwar et al. 2000a) Matrix factoriza-on Generate low-rank approxima9on of matrix Detec9on of latent factors Projec9ng items and users in the same n-dimensional space Predic-on quality can decrease because the original ra9ngs are not taken into account Predic-on quality can increase as a consequence of filtering out some "noise" in the data and detec9ng nontrivial correla9ons in the data Depends on the right choice of the amount of data reduc-on number of singular values in the SVD approach Parameters can be determined and fine-tuned only based on experiments in a certain domain Koren et al talk about 20 to 100 factors that are derived from the ra9ng pagerns

34 Associa-on rule mining

35 Recommenda-on based on Associa-on Rule Mining Simplest approach transform 5-point ra9ngs into binary ra9ngs (1 = above user average) Mine rules such as Item1 Item5 support (2/4), confidence (2/2) (without Alice) Make recommenda-ons for Alice (basic method) Determine "relevant" rules based on Alice's transac9ons (the above rule will be relevant as Alice bought Item1) Determine items not already bought by Alice Sort the items based on the rules' confidence values Different varia-ons possible dislike statements, user associa9ons.. Item1 Item2 Item3 Item4 Item5 Alice ? User User User User

36 Probabilis-c methods Only fair, but often works well in practice

Calcula-on of probabili-es in simplis-c approach Item1 Item2 Item3 Item4 Item5 Alice 1 3 3 2?

37 Calcula-on of probabili-es in simplis-c approach Item1 Item2 Item3 Item4 Item5 Alice ? User User User User X = (Item1 =1, Item2=3, Item3= ) More to consider Zeros (smoothing required) like/dislike simplifica9on possible P(item5=1 X)=P(X item5=1)xp(item5=1) P(X) = 0.125x2/4 =

38 Prac-cal probabilis-c approaches Expectation maximization

39 Slope One predictors (Lemire and Maclachlan 2005) Idea of Slope One predictors is simple and is based on a popularity differen'al between items for users Example: Item1 Item5 Alice 2? User p(alice, Item5) = 2 + ( 2-1 ) = 3 Basic scheme: Take the average of these differences of the co-ra-ngs to make the predic-on In general: Find a func-on of the form f(x) = x + b That is why the name is "Slope One"

40 Example (from Wikipedia) Customer Item A Item B Item C John Mark 3 4 DNR Lucy DNR 2 5 In this case, the average difference in ra9ngs between item B and A is (2+ (-1))/2=0.5. Hence, on average, item A is rated above item B by 0.5. Similarly, the average difference between item C and A is 3. Hence, if we agempt to predict the ra9ng of Lucy for item A using her ra9ng for item B, we get = 2.5. Similarly, if we try to predict her ra9ng for item A using her ra9ng of item C, we get 5+3=8. NOTE: DNR = did not rate

41 Example (from Wikipedia) Customer Item A Item B Item C John Mark 3 4 DNR Lucy DNR 2 5 Average Slope: ((5-3)+(3-4))/2 Slope C->A = 5-2 = 3 Slope B->A = 4-3 = -1 If a user rated several items, the predic9ons are simply combined using a weighted average where a good choice for the weight is the number of users having rated both items. In the above example, we would predict the following ra9ng for Lucy on item A: (2*2.5+1*8)/(2+1) = 13/3 = 4.33 Hence, given n items, to implement Slope One, all that is needed is to compute and store the average differences and the number of common ra9ngs for each of the n 2 pairs of items

42 RF-Rec predictors (Gedikli et al. 2011) Item1 Item2 Item3 Item4 Item5 Alice 1 1? 5 4 User User2 1 1 User User User

43 2008: Factoriza'on meets the neighborhood: a mul'faceted collabora've filtering model, Y. Koren, ACM SIGKDD S-mulated by work on Ne]lix compe--on Prize of $1,000,000 for accuracy improvement of 10% RMSE compared to own Cinematch system Very large dataset (~100M ra-ngs, ~480K users, ~18K movies) Last ra-ngs/user withheld (set K) Root mean squared error metric op-mized to Metrics measure error rate Mean Absolute Error (MAE) computes the devia-on between predicted ra-ngs and actual ra-ngs Root Mean Square Error (RMSE) is similar to MAE, but places more emphasis on larger devia-on

44 2008: Factoriza'on meets the neighborhood: a mul'faceted collabora've filtering model, Y. Koren, ACM SIGKDD Merges neighborhood models with latent factor models Latent factor models good to capture weak signals in the overall data Neighborhood models good at detec-ng strong rela-onships between close items Combina-on in one predic-on single func-on Local search method such as stochas9c gradient descent to determine parameters Add penalty for high values to avoid over-fixng r ˆ = µ + b + b + ui min p, q, b * * * ( u, i) K ( r ui u i µ b u b p i T u q p i T u q i ) 2 + λ( p u 2 + q i 2 + b 2 u + b 2 i )

45 Summarizing recent methods Recommenda-on is concerned with learning from noisy observa-ons (x,y), where f ( x) = yˆ 2 has to be determined such that ( yˆ y) yˆ is minimal. A huge variety of different learning strategies have been applied trying to es-mate f(x) Non parametric neighborhood models MF models, SVMs, Neural Networks, Bayesian Networks,

46 Collabora-ve Filtering Issues Pros: well-understood, works well in some domains, no knowledge engineering required Cons: requires user community, sparsity problems, no integra9on of other knowledge sources, no explana9on of results What is the best CF method? In which situa9on and which domain? Inconsistent findings; always the same domains and data sets; differences between methods are oyen very small (1/100) How to evaluate the predic-on quality? MAE / RMSE: What does an MAE of 0.7 actually mean? Serendipity (novelty and surprising effect of recommenda9ons) Not yet fully understood What about mul--dimensional ra-ngs?

47 The Google News personaliza-on engine

48 Google News portal (1) Aggregates news ar-cles from several thousand sources Displays them to signed-in users in a personalized way Collabora-ve recommenda-on approach based on the click history of the ac9ve user and the history of the larger community Main challenges Vast number of ar9cles and users Generate recommenda9on list in real 9me (at most one second) Constant stream of new items Immediately react to user interac9on Significant efforts with respect to algorithms, engineering, and paralleliza-on are required

49 Google News portal (2) Pure memory-based approaches are not directly applicable and for model-based approaches, the problem of con-nuous model updates must be solved A combina-on of model- and memory-based techniques is used Model-based part: Two clustering techniques are used Probabilis9c Latent Seman9c Indexing (PLSI) as proposed by (Hofmann 2004) MinHash as a hashing method Memory-based part: Analyze story co-visits for dealing with new users Google's MapReduce technique is used for paralleliza-on in order to make computa-on scalable

50 Useful Book: Guidetodatamining.com Chapter 3 is par-cularly useful. Has step-by-step instruc-ons on ra-ngs, user and item-based approaches. Slope-one and naïve Bayes also covered. Link: hvp://guidetodatamining.com

51 Literature (1) [Adomavicius and Tuzhilin 2005] Toward the next genera9on of recommender systems: A survey of the state-of-the-art and possible extensions, IEEE Transac9ons on Knowledge and Data Engineering 17 (2005), no. 6, [Breese et al. 1998] Empirical analysis of predic9ve algorithms for collabora9ve filtering, Proceedings of the 14th Conference on Uncertainty in Ar9ficial Intelligence (Madison, WI) (Gregory F. Cooper and Seraf in Moral, eds.), Morgan Kaufmann, 1998, pp [Gedikli et al. 2011] RF-Rec: Fast and accurate computa9on of recommenda9ons based on ra9ng frequencies, Proceedings of the 13th IEEE Conference on Commerce and Enterprise Compu9ng - CEC 2011, Luxembourg, 2011, forthcoming [Goldberg et al. 2001] Eigentaste: A constant 9me collabora9ve filtering algorithm, Informa9on Retrieval 4 (2001), no. 2, [Golub and Kahan 1965] Calcula9ng the singular values and pseudo-inverse of a matrix, Journal of the Society for Industrial and Applied Mathema9cs, Series B: Numerical Analysis 2 (1965), no. 2, [Herlocker et al. 2002] An empirical analysis of design choices in neighborhood-based collabora9ve filtering algorithms, Informa9on Retrieval 5 (2002), no. 4, [Herlocker et al. 2004] Evalua9ng collabora9ve filtering recommender systems, ACM Transac9ons on Informa9on Systems (TOIS) 22 (2004), no. 1,

52 Literature (2) [Hofmann 2004] Latent seman9c models for collabora9ve filtering, ACM Transac9ons on Informa9on Systems 22 (2004), no. 1, [Huang et al. 2004] Applying associa9ve retrieval techniques to alleviate the sparsity problem in collabora9ve filtering, ACM Transac9ons on Informa9on Systems 22 (2004), no. 1, [Koren et al. 2009] Matrix factoriza8on techniques for recommender systems, Computer 42 (2009), no. 8, [Lemire and Maclachlan 2005] Slope one predictors for online ra9ng-based collabora9ve filtering, Proceedings of the 5th SIAM Interna9onal Conference on Data Mining (SDM 05) (Newport Beach, CA), 2005, pp [Sarwar et al. 2000a] Applica9on of dimensionality reduc9on in recommender systems a case study, Proceedings of the ACM WebKDD Workshop (Boston), 2000 [Zhang and Pu 2007] A recursive predic9on algorithm for collabora9ve filtering recommender systems, Proceedings of the 2007 ACM Conference on Recommender Systems (RecSys 07) (Minneapolis, MN), ACM, 2007, pp

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit