Jeff Howbert Introduction to Machine Learning Winter

Similar documents
Machine Learning Feature Creation and Selection

Recommendation Algorithms: Collaborative Filtering. CSE 6111 Presentation Advanced Algorithms Fall Presented by: Farzana Yasmeen

Introduction. Chapter Background Recommender systems Collaborative based filtering

Use of KNN for the Netflix Prize Ted Hong, Dimitris Tsamis Stanford University

Thanks to Jure Leskovec, Anand Rajaraman, Jeff Ullman

Anonymization Algorithms - Microaggregation and Clustering

CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp

CISC 4631 Data Mining

Distributed Itembased Collaborative Filtering with Apache Mahout. Sebastian Schelter twitter.com/sscdotopen. 7.

Statistical Pattern Recognition

Announcements. CS 188: Artificial Intelligence Spring Generative vs. Discriminative. Classification: Feature Vectors. Project 4: due Friday.

Data Mining. Lecture 03: Nearest Neighbor Learning

SOCIAL MEDIA MINING. Data Mining Essentials

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Infinite data. Filtering data streams

VIDAEXPERT: DATA ANALYSIS Here is the Statistics button.

Wild Mushrooms Classification Edible or Poisonous

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron

Statistical Pattern Recognition

Mining Web Data. Lijun Zhang

REMOVAL OF REDUNDANT AND IRRELEVANT DATA FROM TRAINING DATASETS USING SPEEDY FEATURE SELECTION METHOD

Nearest Neighbor Classification. Machine Learning Fall 2017

CS7267 MACHINE LEARNING NEAREST NEIGHBOR ALGORITHM. Mingon Kang, PhD Computer Science, Kennesaw State University

Using Social Networks to Improve Movie Rating Predictions

Recommendation Systems

Gene Clustering & Classification

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008

Variable Selection 6.783, Biomedical Decision Support

Mining Web Data. Lijun Zhang

Based on Raymond J. Mooney s slides

COSC6376 Cloud Computing Homework 1 Tutorial

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Topic 1 Classification Alternatives

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Machine Learning in Biology

Collaborative Filtering and Recommender Systems. Definitions. .. Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..

k-nearest Neighbor (knn) Sept Youn-Hee Han

Basic Data Mining Technique

Non-Bayesian Classifiers Part I: k-nearest Neighbor Classifier and Distance Functions

Recap: Project and Practicum CS276B. Recommendation Systems. Plan for Today. Sample Applications. What do RSs achieve? Given a set of users and items

Instance-based Learning

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

CS570: Introduction to Data Mining

Proposing a New Metric for Collaborative Filtering

Introduction to Data Mining

Part 11: Collaborative Filtering. Francesco Ricci

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Voronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013

Matrix-Vector Multiplication by MapReduce. From Rajaraman / Ullman- Ch.2 Part 1

Statistical Pattern Recognition

Supervised vs unsupervised clustering

Classification and Regression

Part 11: Collaborative Filtering. Francesco Ricci

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

A Time-based Recommender System using Implicit Feedback

A Dendrogram. Bioinformatics (Lec 17)

Supervised Learning: K-Nearest Neighbors and Decision Trees

Clustering Basic Concepts and Algorithms 1

CSE 573: Artificial Intelligence Autumn 2010

CS573 Data Privacy and Security. Li Xiong

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

By Atul S. Kulkarni Graduate Student, University of Minnesota Duluth. Under The Guidance of Dr. Richard Maclin

Machine Learning: k-nearest Neighbors. Lecture 08. Razvan C. Bunescu School of Electrical Engineering and Computer Science

BBS654 Data Mining. Pinar Duygulu

Additive Regression Applied to a Large-Scale Collaborative Filtering Problem

CS 584 Data Mining. Classification 1

CS 124/LINGUIST 180 From Languages to Information

Community-Based Recommendations: a Solution to the Cold Start Problem

Knowledge Discovery and Data Mining 1 (VO) ( )

Non-trivial extraction of implicit, previously unknown and potentially useful information from data

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Homework Assignment #3

Mathematics of Data. INFO-4604, Applied Machine Learning University of Colorado Boulder. September 5, 2017 Prof. Michael Paul

Recommendation Systems

CSC 411: Lecture 05: Nearest Neighbors

A probabilistic model to resolve diversity-accuracy challenge of recommendation systems

Improving Results and Performance of Collaborative Filtering-based Recommender Systems using Cuckoo Optimization Algorithm

Data Mining Lecture 2: Recommender Systems

Predicting Popular Xbox games based on Search Queries of Users

ECS289: Scalable Machine Learning

SD 372 Pattern Recognition

Singular Value Decomposition, and Application to Recommender Systems

Sampling PCA, enhancing recovered missing values in large scale matrices. Luis Gabriel De Alba Rivera 80555S

Part 12: Advanced Topics in Collaborative Filtering. Francesco Ricci

Parametrizing the easiness of machine learning problems. Sanjoy Dasgupta, UC San Diego

NMLRG #4 meeting in Berlin. Mobile network state characterization and prediction. P.Demestichas (1), S. Vassaki (2,3), A.Georgakopoulos (2,3)

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Collaborative Metric Learning

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011

Recommender Systems - Introduction. Data Mining Lecture 2: Recommender Systems

Classification: Feature Vectors

BordaRank: A Ranking Aggregation Based Approach to Collaborative Filtering

Comparison of different preprocessing techniques and feature selection algorithms in cancer datasets

Collective Intelligence in Action

Distribution-free Predictive Approaches

Collaborative Filtering: A Comparison of Graph-Based Semi-Supervised Learning Methods and Memory-Based Methods

Stat 4510/7510 Homework 4

Transcription:

Collaborative Filtering Nearest es Neighbor Approach Jeff Howbert Introduction to Machine Learning Winter 2012 1

Bad news Netflix Prize data no longer available to public. Just after contest t ended d in July 2009: Plans for Netflix Prize 2 contest were announced Contest data was made available for further public research at UC Irvine repository But a few months later: Netflix was being sued for supposed privacy breaches connected with contest data FTC was investigating privacy concerns By March 2010: Netflix had settled the lawsuit privately Withdrawn the contest data from public use Cancelled Netflix Prize 2 Jeff Howbert Introduction to Machine Learning Winter 2012 2

Good news An older rating dataset from GroupLens is still available, and perfectly suitable for the CSS 490 / 590 project. Consists of data collected through the MovieLens rating website. Comes in 3 sizes: MovieLens 100k MovieLens 1M MovieLens 10M http://www.grouplens.org/node/12 http://lens.umn.edu/login Jeff Howbert Introduction to Machine Learning Winter 2012 3

MovieLens 100k dataset properties 943 users 1682 s 100,000 ratings 1-5 rating scale Rating matrix is 6.3% occupied Ratings per user min = 20 mean = 106 max = 737 Ratings per min = 1 mean = 59 max = 583 Jeff Howbert Introduction to Machine Learning Winter 2012 4

Recommender system definition DOMAIN: some field of activity where users buy, view, consume, or otherwise experience items PROCESS: 1. users provide ratings on items they have experienced 2. Take all < user, item, rating > data and build a predictive model 3. For a user who hasn t experienced a particular item, use model to predict how well they will like it (i.e. predict rating) Jeff Howbert Introduction to Machine Learning Winter 2012 5

Types of recommender systems Predictions can be based on either: content-based approach explicit characteristics of users and items collaborative filtering approach implicit characteristics based on similarity of users preferences to those of other users Jeff Howbert Introduction to Machine Learning Winter 2012 6

Collaborative filtering algorithms Common types: Global effects Nearest neighbor Matrix factorization Restricted Boltzmann machine Clustering Etc. Jeff Howbert Introduction to Machine Learning Winter 2012 7

Nearest neighbor in action 1 2 3 4 5 6 7 8 9 10 17 7770 user 1 1 2 3 user 2 2 3 3 4? user 3 5 3 user 4 2 3 2 2 user 5 2 3 5 4 2 4 user 6 2 user 7 2 4 2 user 8 3 1 3 4 5 4 user 9 3 user 10 1 2 2 user 480189 4 3 3 Identical preferences strong weight Similar il preferences moderate weight Jeff Howbert Introduction to Machine Learning Winter 2012 8

Nearest neighbor classifiers Unknown record Requires three inputs: 1. The set of stored samples 2. Distance metric to compute distance between samples 3. The value of k, the number of nearest neighbors to retrieve Jeff Howbert Introduction to Machine Learning Winter 2012 9

Nearest neighbor classifiers Unknown record To classify unknown record: 1. Compute distance to other training records 2. Identify k nearest neighbors 3. Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote) Jeff Howbert Introduction to Machine Learning Winter 2012 10

Nearest neighbor classification Compute distance between two points Example: Euclidean distance d( x, y) = i ( x i y i Options for determining the class from nearest neighbor list Take majority vote of class labels among the k-nearest neighbors Weight the votes according to distance 2 ) example: weight factor w = 1 / d 2 Jeff Howbert Introduction to Machine Learning Winter 2012 11

Nearest neighbor in collaborative filtering For our implementation in Project 2: Actually a regression, not a classification. Prediction is a weighted combination of neighbor s ratings (real number). We consider all neighbors, not the k-nearest subset of neighbors. Since we re not ranking neighbors by distance, distance no longer relevant. Instead of distance, we calculate similarities that determine weightings of each neighbor s rating. Jeff Howbert Introduction to Machine Learning Winter 2012 12

For this example: Nearest neighbor in action Find every user that has rated 10 Compute similarity between user 2 and each of those users Weight those users ratings according to their similarities Predicted rating for user 2 is sum of other users weighted ratings on 10 1 2 3 4 5 6 7 8 9 10 17770 user 1 1 2 3 user 2 2 3 3 4? user 3 5 3 user 4 2 3 2 2 user 5 2 3 5 4 2 4 user 6 2 user 7 2 4 2 user 8 3 1 3 4 5 4 user 9 3 Identical preferences strong weight Similar preferences moderate weight Jeff Howbert Introduction to Machine Learning Winter 2012 13

Measuring similarity of users For Project 2 we will use Pearson s correlation coefficient (PCC) as a measure of similarity between users. Pearson s correlation coefficient is covariance normalized by the standard d deviations of the two variables: cov( x, y) corr ( x, y) = σ x σ y Always lies in range -1 to 1 Jeff Howbert Introduction to Machine Learning Winter 2012 14

Measuring similarity of users PCC similarity for two users a and b: PCC( a, b) = n j= 1 n j= 1 ( r ( r a, r )( r, r ) j a n 2 2 a, j ra ) ( rb, j rb ) j= 1 Both sums are taken over only those s rated by both a and b (indexed by j) r aj a,j = rating by user a on j r a = average rating on all s rated by user a n = number of s rated by both a and b b j b Jeff Howbert Introduction to Machine Learning Winter 2012 15

Mesauring similarity of users Calculating PCC on sparse matrix Calculate user average rating using only those cells where a rating exists. Subtract user average rating only from those cells where rating exists. Calculate and sum user-user cross-products and user deviations from average only for those s where a rating exists for both users. 1 2 3 4 5 6 7 8 9 10 17770 user 1 1 2 3 user 2 2 3 3 4? user 3 5 3 user 4 2 3 2 2 user 5 2 3 5 4 2 4 user 6 2 user 7 2 4 2 user 8 3 1 3 4 5 4 user 9 3 Identical preferences strong weight Similar preferences moderate weight Jeff Howbert Introduction to Machine Learning Winter 2012 16