Recommender Systems 6CCS3WSN-7CCSMWAL

Similar documents
Thanks to Jure Leskovec, Anand Rajaraman, Jeff Ullman

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Infinite data. Filtering data streams

Title of Projects: - Implementing a simple Recommender System on user based rating and testing the same.

CS 124/LINGUIST 180 From Languages to Information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS 124/LINGUIST 180 From Languages to Information

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

CS 124/LINGUIST 180 From Languages to Information

Part 11: Collaborative Filtering. Francesco Ricci

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Introduction to Data Mining

Singular Value Decomposition, and Application to Recommender Systems

Recommendation Systems

Machine Learning using MapReduce

Part 11: Collaborative Filtering. Francesco Ricci

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

BBS654 Data Mining. Pinar Duygulu

Hands-On Exercise: Implementing a Basic Recommender

Hybrid Recommendation System Using Clustering and Collaborative Filtering

COMP6237 Data Mining Making Recommendations. Jonathon Hare

CSE 258 Lecture 8. Web Mining and Recommender Systems. Extensions of latent-factor models, (and more on the Netflix prize)

Recommender Systems (RSs)

CS224W Project: Recommendation System Models in Product Rating Predictions

Information Retrieval. Lecture 7

CSE 158 Lecture 8. Web Mining and Recommender Systems. Extensions of latent-factor models, (and more on the Netflix prize)

Matrix-Vector Multiplication by MapReduce. From Rajaraman / Ullman- Ch.2 Part 1

Data Mining Lecture 2: Recommender Systems

Recommender Systems - Introduction. Data Mining Lecture 2: Recommender Systems

CS435 Introduction to Big Data Spring 2018 Colorado State University. 3/21/2018 Week 10-B Sangmi Lee Pallickara. FAQs. Collaborative filtering

Data Mining Classification: Alternative Techniques. Imbalanced Class Problem

List of Exercises: Data Mining 1 December 12th, 2015

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati

Weka ( )

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

Non-trivial extraction of implicit, previously unknown and potentially useful information from data

THIS LECTURE. How do we know if our results are any good? Results summaries: Evaluating a search engine. Making our good results usable to a user

Use of KNN for the Netflix Prize Ted Hong, Dimitris Tsamis Stanford University

INTRODUCTION TO MACHINE LEARNING. Measuring model performance or error

Recommender Systems New Approaches with Netflix Dataset

Ranked Retrieval. Evaluation in IR. One option is to average the precision scores at discrete. points on the ROC curve But which points?

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CCRMA MIR Workshop 2014 Evaluating Information Retrieval Systems. Leigh M. Smith Humtap Inc.

Web Information Retrieval. Exercises Evaluation in information retrieval

Part 7: Evaluation of IR Systems Francesco Ricci

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CptS 570 Machine Learning Project: Netflix Competition. Parisa Rashidi Vikramaditya Jakkula. Team: MLSurvivors. Wednesday, December 12, 2007

2. On classification and related tasks

Part 12: Advanced Topics in Collaborative Filtering. Francesco Ricci

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Towards a hybrid approach to Netflix Challenge

CPSC 340: Machine Learning and Data Mining. Recommender Systems Fall 2017

Information Retrieval

Evaluating Classifiers

Chapter III.2: Basic ranking & evaluation measures

MovieRec - CS 410 Project Report

Representation of Documents and Infomation Retrieval

Chapter 6: Information Retrieval and Web Search. An introduction

Collaborative Filtering for Netflix

Information Retrieval. (M&S Ch 15)

Association Rules. CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University. Slides adapted from lectures by Jeff Ullman

Social Search Networks of People and Search Engines. CS6200 Information Retrieval

International Journal of Advance Engineering and Research Development. A Facebook Profile Based TV Shows and Movies Recommendation System

Evaluating Classifiers

Demystifying movie ratings 224W Project Report. Amritha Raghunath Vignesh Ganapathi Subramanian

Search Evaluation. Tao Yang CS293S Slides partially based on text book [CMS] [MRS]

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

CS6322: Information Retrieval Sanda Harabagiu. Lecture 13: Evaluation

Information Retrieval

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

Evaluation of different biological data and computational classification methods for use in protein interaction prediction.

Classification Part 4

ELEC6910Q Analytics and Systems for Social Media and Big Data Applications Lecture 4. Prof. James She

5/13/2009. Introduction. Introduction. Introduction. Introduction. Introduction

CS4491/CS 7265 BIG DATA ANALYTICS

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Recommender Systems - Content, Collaborative, Hybrid

Using Social Networks to Improve Movie Rating Predictions

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing

Overview. Lecture 6: Evaluation. Summary: Ranked retrieval. Overview. Information Retrieval Computer Science Tripos Part II.

Information Retrieval

Mining Social Network Graphs

Machine Learning and Bioinformatics 機器學習與生物資訊學

Evaluating search engines CE-324: Modern Information Retrieval Sharif University of Technology

CSCI 5417 Information Retrieval Systems. Jim Martin!

Recommender system techniques applied to Netflix movie data

Information Retrieval CSCI

Predictive Analysis: Evaluation and Experimentation. Heejun Kim

CS249: ADVANCED DATA MINING

Evaluating search engines CE-324: Modern Information Retrieval Sharif University of Technology

Evaluating Classifiers

Information Retrieval

The Principle and Improvement of the Algorithm of Matrix Factorization Model based on ALS

Computational Intelligence Meets the NetFlix Prize

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

CS145: INTRODUCTION TO DATA MINING

Retrieval Evaluation. Hongning Wang

Orange3 Data Fusion Documentation. Biolab

Transcription:

Recommender Systems 6CCS3WSN-7CCSMWAL

http://insidebigdata.com/wp-content/uploads/2014/06/humorrecommender.jpg

Some basic methods of recommendation Recommend popular items Collaborative Filtering Item-to-Item: People who buy X also buy Y Amazon (Items), Facebook (Friends), YouTube (Movies) Content Based Filtering User Profile plus Description of Items If User watched a lot of Spy movies, recommend items classified as Spy movies. Used by Netflix (among other things) Whatever (graph clustering...)

Amazon.com: Item-to-Item Collaborative Filtering

User Personal Profile

Various types of Item recommendation

Netflix: Content based filtering One Netflix personalization is the collection of genre rows (aimed at the users tastes). These range from familiar high-level categories like Comedies and Dramas to highly tailored slices such as Imaginative Time Travel Movies. Each row has 3 layers of personalization (for the user): the choice of genre itself, the subset of titles selected within that genre, and the ranking of those titles. (Experimentally) we measured an increase in member retention by placing the most (user) tailored rows higher on the page instead of lower. (Example of A/B testing)

The Netflix Prize The Netflix Prize and the Recommendation Problem In 2006 we announced the Netflix Prize, a machine learning and data mining competition for movie rating prediction. We offered $1 million to whoever improved the accuracy of our existing system called Cinematch by 10%. We conducted this competition to find new ways to improve the recommendations we provide to our members, which is a key part of our business. However, we had to come up with a proxy question that was easier to evaluate and quantify: the root mean squared error (RMSE) of the predicted rating. The race was on to beat our RMSE of 0.9525 with the finish line of reducing it to 0.8572 or less. A year into the competition, the Korbell team won the first Progress Prize with an 8.43% improvement. They reported more than 2000 hours of work in order to come up with the final combination of 107 algorithms that gave them this prize. ( ) To put these algorithms to use, we had to work to overcome some limitations, for instance that they were built to handle 100 million ratings, instead of the more than 5 billion that we have, and that they were not built to adapt as members added more ratings. But once we overcame those challenges, we put the two algorithms into production, where they are still used as part of our recommendation engine. http://techblog.netflix.com/2012/04/netflix-recommendations-beyond-5-stars.html

Collaborative filtering Item based User based

Basic idea: Exploit User-Item relationships Item based: Path melon-a-grapes. People who like Item melon also like Item grapes User based: Path C-Melon-A. User A and C similar. Application? Recommend User A s items to C (shopping) Recommend User A and C to each other (online dating)

Example: Item based Collaborative Filtering Cosine Similarity S(a, b) = cos(a, b) = a b a b a = (a 1,..., a n ) is a vector, a b = n i=1 a ib i, a 2 = n i=1 a2 i. Where did we see this before?

We will consider the following sample data of preference of four users for three items: ID user item rating 241 u1 m1 2 222 u1 m3 3 276 u2 m1 5 273 u2 m2 2 200 u3 m1 3 229 u3 m2 3 231 u3 m3 1 239 u4 m2 2 286 u4 m3 2 Step 1: Write the user-item ratings data in a matrix form. m1 m2 m3 u1 2? 3 u2 5 2? u3 3 3 1 u4? 2 2

Calculate similarity Step 2. Calculate similarity between items, e.g. m1 and m2. u1 2? m1 m2 m1 m2 u2 5 2 ------> 5 2 u3 3 3 3 3 u4? 2 Fortunately both m1 and m2 have been rated by users u2 and u3. We create two item-vectors, v1 for item m1 and v2 for item m2, and find the cosine similarity between them. At this point there are several approaches. We use the one where the similarity is based on all pairs of users who rated both items, ignoring their other ratings. Thus, the two item-vectors would be, v 1 = 5u 2 + 3u 3 v 2 = 2u 2 + 3u 3 The cosine similarity between the two vectors, v1 and v2, would then be: cos(v 1, v 2 ) = 5 2 + 3 3 (25 + 9) (4 + 9) = 0.904

Item-item similarity The complete item-to-item similarity matrix as follows: m1 m2 m3 m1 1 0.90 0.79 m2 1 0.87 m3 1 This table can be pre-computed. Step 3. Use table to estimate user ratings for missing items. u1 rated m1 and m3 m1 m2 m3 u1 2? 3 R(u1, m2) = 2S(m1, m2) + 3S(m2, m3) S(m1, m2) + S(m2, m3) R(u2, m3) = R(u4, m1) = = 5S(m1, m3) + 2S(m2, m3) S(m1, m3) + S(m2, m3) 2S(m1, m2) + 2S(m1, m3) S(m1, m2) + S(m1, m3) 2 0.9 + 3 0.87 0.9 + 0.87 = 3.4 = 2 = 2.7

Fill in missing values Before m1 m2 m3 u1 2? 3 u2 5 2? u3 3 3 1 u4? 2 2 After m1 m2 m3 u1 2 2.7 3 u2 5 2 3.4 u3 3 3 1 u4 2 2 2

Content based filtering

Content based filtering J s favorite cake is Choco Cream. J went to a cake shop for it, but Choco Cream cakes were sold out. J asked the shopkeeper to recommend something similar and was recommended Choco Fudge, a cake that has the same ingredients. J bought it. Content-based (CB) filtering systems are systems recommending items similar to items a user liked in the past. These systems focus on algorithms, which assemble users preferences into users profiles and all items information into items profiles. Then they recommend those items close to the user by similarity of their profiles. A user profile is a set of assigned keywords (terms, features) collected from items previously found relevant (or interesting) by the user. An item profile is a set of assigned keywords (terms, features) of the item itself. See http://recommender.no/info/content-based-filtering-recommender-systems/

J liked Choco Cream cakes, its ingredients (along with other things J likes) form Js user profile. The system reviewed other available item profiles and found that Choco Fudge cake was the most similar in the item profile. The similarity is high because both cakes have many of the same ingredients (chocolate, sugar, sponge cake). This was the reason for the recommendation.

J liked Choco Cream cakes, its ingredients (along with other things J likes) form Js user profile. The system reviewed other available item profiles and found that Choco Fudge cake was the most similar in the item profile. The similarity is high because both cakes have many of the same ingredients (chocolate, sugar, sponge cake). This was the reason for the recommendation. Q: Where have we seen this sort of thing before?

Where have we seen this before? Item-1= (property 1, property 2,...,property n) Item-2= (property 1, property 2,...,property n) Item-k= (property 1, property 2,...,property n) User-Tastes= (property 1, property 2,...,property n) An Item is a vector of properties. A Users-Taste is a vector of properties. Retrieve the Items most appropriate to the Users Tastes The quality of the system depends on finding good descriptive properties

Classic Information Retrieval A document is a vector of terms. A user query is a vector of terms. Retrieve the documents most appropriate to the user query An Item is a vector of properties. A Users Tastes is a vector of properties. Retrieve the Items most appropriate to the Users Tastes Classic Information Retrieval

Summary A common approach to designing recommender systems is content-based filtering. Content-based filtering methods are based on a description of the item and a profile of the users preference. In a content-based recommender system, keywords are used to describe the items and a user profile is built to indicate the type of item this user likes. In other words, the algorithm tries to recommend items that are similar to those that a user liked in the past (or is examining in the present). In particular, various candidate items are compared with items previously rated by the user and the best-matching items are recommended. This approach has its roots in information retrieval and information filtering research. To abstract the features of the items in the system, an item presentation algorithm is applied. A widely used algorithm is the tf-idf representation (also called vector space representation). To create a user profile, the system mostly focuses on two types of information: 1. A model of the user s preference. 2. A history of the user s interaction with the recommender system.

Compare some methods for Movilens data Movilens. https://movielens.org/ Non-commercial, personalized movie recommendations. Data (User, movie, rating,...) Compare Popular, Random, UBCF, IBCF recommendations Create an evaluation scheme for the data set Take 90% of data for training (to build the data matrix), predict the top n recommendations for each user based on various methods, and then check the answer against the 10% of the data we kept back Get n = 1, 3, 5, 10, 15, 20 recommendations for users

Results. ROC and precision-recall

The meaning of plots. See IR lectures True and false positives True positive rate = True positives/relevant-docs False positive rate = False positives/ Non-relevant The simplest case. We know the true answer. (which documents are Relevant). We look how the classifier worked Evaluating an IR system Precision: fraction of retrieved docs that are relevant Recall: fraction of relevant docs that are retrieved False negatives: relevant docs judged as non-relevant by IR system Consider the first row sum and first column sum Relevant Non-relevant Retrieved tp (true positive) fp (false positive) Not Retrieved fn (false negatives) tn (true negatives) 4 Precision Recall P = tp / (tp + fp) R = tp / (tp + fn) 8

Various comments It seems like UBCF did better than IBCF. Then why would we use IBCF? The answer lies is when and how are we generating recommendations. UBCF saves the whole matrix of data and generates the recommendation at predict by finding the closest user. IBCF saves only k closest items in the matrix and doesnt have to generate everything. It is pre-calculated and predict simply reads off the closest items. Understandably, RANDOM is the worst. But perhaps surprisingly, its hard to beat POPULAR. I guess we are not so different, you and I. Quoted from https://www.r-bloggers.com/testing-recommender-systems-in-r/

R for experiment #https://www.r-bloggers.com/testing-recommender-systems-in-r/ # Load required library library(recommenderlab) data(movielense) # 943 x 1664 rating matrix of class realratingmatrix with 99392 ratings. # Let s check some algorithms against each other scheme <- evaluationscheme(movielense, method = "split", train =.9, k = 1, given = 10, goodrating = 4) #scheme? Read up details 90% of data used for training (fill in the matrix) algorithms <- list( "random items" = list(name="random", param=list(normalize = "Z-score")), "popular items" = list(name="popular", param=list(normalize = "Z-score")), "user-based CF" = list(name="ubcf", param=list(normalize = "Z-score", method="cosine", nn=50, minrating=3)), "item-based CF" = list(name="ibcf", param=list(normalize = "Z-score" )) ) # run algorithms, predict next n movies results <- evaluate(scheme, algorithms, n=c(1, 3, 5, 10, 15, 20)) # Draw ROC curve plot(results, annotate = 1:4, legend="topleft") # See precision / recall plot(results, "prec/rec", annotate=3)

The notes used material from: The Netflix Prize http://techblog.netflix.com/2012/04/netflixrecommendations-beyond-5-stars.html Amazon.com Recommendations, Item-to-Item Collaborative Filtering https://www.cs.umd.edu/ samir/498/amazon-recommendations.pdf Chapter 9 of Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman. http://www.mmds.org/#book https://ashokharnal.wordpress.com/2014/12/18/worked-out-example-itembased-collaborative-filtering-for-recommenmder-engine/ Example: Item based Collaborative Filtering. But the working is wrong on that page. http://recommender.no/ All sorts of stuff. e.g. http://recommender.no/info/content-based-filtering-recommender-systems/ And not forgetting Wikipedia. https://www.r-bloggers.com/testing-recommender-systems-in-r/ https://sanealytics.com/2012/06/10/testing-recommender-systems-in-r/