Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

Similar documents
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Infinite data. Filtering data streams

Thanks to Jure Leskovec, Anand Rajaraman, Jeff Ullman

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Introduction to Data Mining

BBS654 Data Mining. Pinar Duygulu

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS 124/LINGUIST 180 From Languages to Information

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #16: Recommenda2on Systems

CS 124/LINGUIST 180 From Languages to Information

CS 124/LINGUIST 180 From Languages to Information

Recommendation and Advertising. Shannon Quinn (with thanks to J. Leskovec, A. Rajaraman, and J. Ullman of Stanford University)

Instructor: Dr. Mehmet Aktaş. Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

Recommender Systems Collabora2ve Filtering and Matrix Factoriza2on

Recommender Systems - Content, Collaborative, Hybrid

Recommender Systems 6CCS3WSN-7CCSMWAL

Recommender Systems. Collaborative Filtering & Content-Based Recommending

Big Data Analytics CSCI 4030

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS 345A Data Mining Lecture 1. Introduction to Web Mining

Real-time Recommendations on Spark. Jan Neumann, Sridhar Alla (Comcast Labs) DC Spark Interactive Meetup East May

COMP 465: Data Mining Recommender Systems

Recommendation Systems

Data Mining Lecture 2: Recommender Systems

Recommender Systems - Introduction. Data Mining Lecture 2: Recommender Systems

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Introduction to Data Mining

Knowledge Discovery and Data Mining 1 (VO) ( )

COMP6237 Data Mining Making Recommendations. Jonathon Hare

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Agenda. Collabora-ve Filtering (CF)

Part 12: Advanced Topics in Collaborative Filtering. Francesco Ricci

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

3 announcements: Thanks for filling out the HW1 poll HW2 is due today 5pm (scans must be readable) HW3 will be posted today

High dim. data. Graph data. Infinite data. Machine learning. Apps. Locality sensitive hashing. Filtering data streams.

Big Data Analytics CSCI 4030

Recommender System. What is it? How to build it? Challenges. R package: recommenderlab

Using Social Networks to Improve Movie Rating Predictions

Big Data Analytics CSCI 4030

Mining Web Data. Lijun Zhang

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

Towards a hybrid approach to Netflix Challenge

Recommender Systems. Techniques of AI

Use of KNN for the Netflix Prize Ted Hong, Dimitris Tsamis Stanford University

Part 11: Collaborative Filtering. Francesco Ricci

Mining Social Network Graphs

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp

CS249: ADVANCED DATA MINING

Mining Web Data. Lijun Zhang

CSE 5243 INTRO. TO DATA MINING

Data Mining Techniques

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Machine Learning using MapReduce

Data Mining Techniques

CS224W Project: Recommendation System Models in Product Rating Predictions

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Jeff Howbert Introduction to Machine Learning Winter

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li

Predicting User Ratings Using Status Models on Amazon.com

CSE 258 Lecture 8. Web Mining and Recommender Systems. Extensions of latent-factor models, (and more on the Netflix prize)

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri

Recommendation Algorithms: Collaborative Filtering. CSE 6111 Presentation Advanced Algorithms Fall Presented by: Farzana Yasmeen

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Performance Comparison of Algorithms for Movie Rating Estimation

Web Personalization & Recommender Systems

arxiv: v4 [cs.ir] 28 Jul 2016

Web Personalization & Recommender Systems

Collaborative Filtering using Weighted BiPartite Graph Projection A Recommendation System for Yelp

CSE 158 Lecture 8. Web Mining and Recommender Systems. Extensions of latent-factor models, (and more on the Netflix prize)

Part 11: Collaborative Filtering. Francesco Ricci

Distributed Itembased Collaborative Filtering with Apache Mahout. Sebastian Schelter twitter.com/sscdotopen. 7.

Problem 1: Complexity of Update Rules for Logistic Regression

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Matrix-Vector Multiplication by MapReduce. From Rajaraman / Ullman- Ch.2 Part 1

CSE 158. Web Mining and Recommender Systems. Midterm recap

The influence of social filtering in recommender systems

General Instructions. Questions

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

International Journal of Advance Engineering and Research Development. A Facebook Profile Based TV Shows and Movies Recommendation System

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #19: Machine Learning 1

Feature Extractors. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. The Perceptron Update Rule.

Machine Learning and Data Mining. Collaborative Filtering & Recommender Systems. Kalev Kask

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron

A probabilistic model to resolve diversity-accuracy challenge of recommendation systems

CS425: Algorithms for Web Scale Data

Text Analytics (Text Mining)

CPSC 340: Machine Learning and Data Mining. Recommender Systems Fall 2017

Classification: Feature Vectors

Text Analytics (Text Mining)

Collaborative Filtering using Euclidean Distance in Recommendation Engine

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

Analysis of Large Graphs: TrustRank and WebSpam

Slides based on those in:

SOCIAL MEDIA MINING. Data Mining Essentials

Nearest Neighbor with KD Trees

Transcription:

Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: http://www.mmds.org Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University http://www.mmds.org

High dim. data Graph data Infinite data Machine learning Apps Locality sensi-ve hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detec-on Web adver-sing Decision Trees Associa-on Rules Dimensional ity reduc-on Spam Detec-on Queries on streams Perceptron, knn Duplicate document detec-on

Customer X Buys Metallica CD Buys Megadeth CD Customer Y Does search on Metallica Recommender system suggests Megadeth from data collected about customer X

Examples: Search Recommendations Items Products, web sites, blogs, news items,

Shelf space is a scarce commodity for tradi6onal retailers Also: TV networks, movie theaters, Web enables near- zero- cost dissemina6on of informa6on about products From scarcity to abundance More choice necessitates be=er filters Recommenda-on engines How Into Thin Air made Touching the Void a bestseller: hlp://www.wired.com/wired/archive/.0/tail.html

Source: Chris Anderson (00) 6

Read http://www.wired.com/wired/archive/.0/tail.html to learn more! 7

Editorial and hand curated List of favorites Lists of essen-al items Simple aggregates Top 0, Most Popular, Recent Uploads Tailored to individual users Amazon, Ne]lix, 8

X = set of Customers S = set of Items U6lity func6on u: X S à R R = set of ra-ngs R is a totally ordered set e.g., 0- stars, real number in [0,] 9

Avatar LOTR Matrix Pirates Alice 0. Bob 0. 0. Carol 0. David 0. 0

() Gathering known ra6ngs for matrix How to collect the data in the u-lity matrix () Extrapolate unknown ra6ngs from the known ones Mainly interested in high unknown ra-ngs We are not interested in knowing what you don t like but what you like () Evalua6ng extrapola6on methods How to measure success/performance of recommenda-on methods

Explicit Ask people to rate items Doesn t work well in prac-ce people can t be bothered Implicit Learn ra-ngs from user ac-ons E.g., purchase implies high ra-ng What about low ra-ngs?

Key problem: U-lity matrix U is sparse Most people have not rated most items Cold start: New items have no ra-ngs New users have no history Three approaches to recommender systems: ) Content- based ) Collabora-ve ) Latent factor based Today!

Main idea: Recommend items to customer x similar to previous items rated highly by x Example: Movie recommenda6ons Recommend movies with same actor(s), director, genre, Websites, blogs, news Recommend other sites with similar content

Item profiles likes recommend build match Red Circles Triangles User profile 6

For each item, create an item profile Profile is a set (vector) of features Movies: author, -tle, actor, director, Text: Set of important words in document How to pick important features? Usual heuris-c from text mining is TF- IDF (Term frequency * Inverse Doc Frequency) Term Feature Document Item 7

f ij = frequency of term (feature) i in doc (item) j n i = number of docs that men-on term i N = total number of docs Note: we normalize TF to discount for longer documents TF- IDF score: w ij = TF ij IDF i Doc profile = set of words with highest TF- IDF scores, together with their scores 8

User profile possibili6es: Weighted average of rated item profiles Varia6on: weight by difference from average ra-ng for item Predic6on heuris6c: Given user profile x and item profile i, es-mate u(x,i) = cos (x,i) = x i/ x i 9

+: No need for data on other users No cold- start or sparsity problems +: Able to recommend to users with unique tastes +: Able to recommend new & unpopular items No first- rater problem +: Able to provide explana6ons Can provide explana-ons of recommended items by lis-ng content- features that caused an item to be recommended 0

: Finding the appropriate features is hard E.g., images, movies, music : Recommenda6ons for new users How to build a user profile? : Overspecializa6on Never recommends items outside user s content profile People might have mul-ple interests Unable to exploit quality judgments of other users

Harnessing quality judgments of other users

Consider user x Find set N of other users whose ra-ngs are similar to x s ra-ngs Es-mate x s ra-ngs based on ra-ngs of users in N x N

Let r x be the vector of user x s ra-ngs Jaccard similarity measure Problem: Ignores the value of the ra-ng Cosine similarity measure r x = [*, _, _, *, ***] r y = [*, _, **, **, _] sim(x, y) = cos(r x, r y ) = r x r y / r x r y Problem: Treats missing ra-ngs as nega-ve Pearson correla6on coefficient S xy = items rated by both users x and y r x, r y as sets: r x = {,, } r y = {,, } r x, r y as points: r x = {, 0, 0,, } r y = {, 0,,, 0} sim(x,y)= s S xy ( r xs r x )( r ys r y ) / s S xy ( r xs r x ) r x, r y avg. rating of x, y

Cosine sim(x, sim: y) = i r xi r yi Intui6vely we want: sim(a, B) > sim(a, C) Jaccard similarity: / < / Cosine similarity: 0.86 > 0. Considers missing ra-ngs as nega-ve Solu6on: subtract the (row) mean sim A,B vs. A,C: 0.09 > -0.9 Notice cosine sim. is correlation when data is centered at 0

From similarity metric to recommenda6ons: Let r x be the vector of user x s ra-ngs Let N be the set of k users most similar to x who have rated item i Predic6on for item s of user x: r xi = /k y N r yi Shorthand: s xy =sim(x,y) r xi = y N s xy r yi / y N s xy Other op-ons? Many other tricks possible 6

So far: User- user collabora6ve filtering Another view: Item- item For item i, find other similar items Es-mate ra-ng for item i based on ra-ngs for similar items Can use same similarity metrics and predic-on func-ons as in user- user model r xi = j N ( i; x) s ij j N ( i; x) s r ij xj s ij similarity of items i and j r xj rating of user u on item j N(i;x) set items rated by x similar to i 7

0 9 8 7 6 6 users movies - unknown rating - rating between to 8

0 9 8 7 6? 6 users - estimate rating of movie by user 9 movies

users? 6 7 8 9 0 sim(,m).00-0.8 movies 0. -0.0-0. 6 0.9 Neighbor selection: Identify movies similar to movie, rated by user Here we use Pearson correlation as similarity: ) Subtract mean rating m i from each movie i m = (++++)/ =.6 row : [-.6, 0, -0.6, 0, 0,., 0, 0,., 0, 0., 0] ) Compute cosine similarities between rows 0

users? 6 7 8 9 0 sim(,m).00-0.8 movies 0. -0.0-0. 6 0.9 Compute similarity weights: s, =0., s,6 =0.9

0 9 8 7 6.6 6 users Predict by taking weighted average: r. = (0.* + 0.9*) / (0.+0.9) =.6 movies r ix = j N(i;x) s ij r

Define similarity s ij of items i and j Select k nearest neighbors N(i; x) Before: Items most similar to i, that were rated by x Es-mate ra-ng r xi as the weighted average: r xi = j N ( i; x) s j N ( i; x) ij s r ij xj r xi = b xi + j N ( i; x) s ij ( r j N ( i; x) xj s ij b xj ) baseline estimate for r xi μ = overall mean movie ra-ng b xi =μ+ b x + b i b x = ra-ng devia-on of user x = (avg. ra'ng of user x) μ b i = ra-ng devia-on of movie i

Avatar LOTR Matrix Pirates Alice 0.8 Bob 0. 0. Carol 0.9 0.8 David 0. In prac6ce, it has been observed that item- item olen works be=er than user- user Why? Items are simpler, users have mul-ple tastes

+ Works for any kind of item No feature selec-on needed - Cold Start: Need enough users in the system to find a match - Sparsity: The user/ra-ngs matrix is sparse Hard to find users that have rated the same items - First rater: Cannot recommend an item that has not been previously rated New items, Esoteric items - Popularity bias: Cannot recommend items to someone with unique taste Tends to recommend popular items

Implement two or more different recommenders and combine predic6ons Perhaps using a linear model Add content- based methods to collabora6ve filtering Item profiles for new item problem Demographics to deal with new user problem 6

- Evalua6on - Error metrics - Complexity / Speed 7

movies users 8

users movies????? Test Data Set 9

Compare predic6ons with known ra6ngs Root- mean- square error (RMSE) xi ( r xi r xi ) where rxi is predicted, r xi is the true ra-ng of x on i Precision at top 0: % of those in top 0 Rank Correla6on: Spearman s correla'on between system s and user s complete rankings Another approach: 0/ model Coverage: Number of items/users for which system can make predic-ons Precision: Accuracy of predic-ons Receiver opera6ng characteris6c (ROC) Tradeoff curve between false posi-ves and false nega-ves 0

Narrow focus on accuracy some6mes misses the point Predic-on Diversity Predic-on Context Order of predic-ons In prac6ce, we care only to predict high ra6ngs: RMSE might penalize a method that does well for high ra-ngs and badly for others

Expensive step is finding k most similar customers: O( X ) Too expensive to do at run6me Could pre- compute Naïve pre- computa-on takes -me O(k X ) X set of customers We already know how to do this! Near- neighbor search in high dimensions (LSH) Clustering Dimensionality reduc-on

Leverage all the data Don t try to reduce data size in an effort to make fancy algorithms work Simple methods on large data do best Add more data e.g., add IMDB data on genres More data beats be=er algorithms http://anand.typepad.com/datawocky/008/0/more-data-usual.html