CS435 Introduction to Big Data Spring 2018 Colorado State University. 3/21/2018 Week 10-B Sangmi Lee Pallickara. FAQs. Collaborative filtering

Similar documents
CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

AMAZON.COM RECOMMENDATIONS ITEM-TO-ITEM COLLABORATIVE FILTERING PAPER BY GREG LINDEN, BRENT SMITH, AND JEREMY YORK

CS4491/CS 7265 BIG DATA ANALYTICS

How do we obtain reliable estimates of performance measures?

ECLT 5810 Evaluation of Classification Quality

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

CS145: INTRODUCTION TO DATA MINING

CS249: ADVANCED DATA MINING

Part 12: Advanced Topics in Collaborative Filtering. Francesco Ricci

Machine Learning and Bioinformatics 機器學習與生物資訊學

List of Exercises: Data Mining 1 December 12th, 2015

Today s topics. FAQs. Modify the way data is loaded on disk. Methods of the InputFormat abstract. Input and Output Patterns --Continued

Lecture 6 K- Nearest Neighbors(KNN) And Predictive Accuracy

Classification Part 4

K- Nearest Neighbors(KNN) And Predictive Accuracy

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Chapter 3: Supervised Learning

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

Property1 Property2. by Elvir Sabic. Recommender Systems Seminar Prof. Dr. Ulf Brefeld TU Darmstadt, WS 2013/14

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Data Mining: Classifier Evaluation. CSCI-B490 Seminar in Computer Science (Data Mining)

Evaluating Classifiers

Part 11: Collaborative Filtering. Francesco Ricci

Evaluating Classifiers

Comparative performance of opensource recommender systems

Recommender Systems (RSs)

CS 584 Data Mining. Classification 3

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

Section A. 1. a) Explain the evolution of information systems into today s complex information ecosystems and its consequences.

Evaluating Classifiers

Evaluating Machine Learning Methods: Part 1

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

Machine Learning Techniques for Data Mining

Evaluating Machine-Learning Methods. Goals for the lecture

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

CSE 158. Web Mining and Recommender Systems. Midterm recap

Knowledge Discovery and Data Mining 1 (VO) ( )

Singular Value Decomposition, and Application to Recommender Systems

Predict the Likelihood of Responding to Direct Mail Campaign in Consumer Lending Industry

Thanks to Jure Leskovec, Anand Rajaraman, Jeff Ullman

Machine Learning. Cross Validation

Recommender Systems 6CCS3WSN-7CCSMWAL

Recommender Systems. Nivio Ziviani. Junho de Departamento de Ciência da Computação da UFMG

CS 124/LINGUIST 180 From Languages to Information

CS 124/LINGUIST 180 From Languages to Information

Model s Performance Measures

Diversity in Recommender Systems Week 2: The Problems. Toni Mikkola, Andy Valjakka, Heng Gui, Wilson Poon

I211: Information infrastructure II

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Non-trivial extraction of implicit, previously unknown and potentially useful information from data

Classification. Instructor: Wei Ding

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Infinite data. Filtering data streams

Chapter 2 Basic Structure of High-Dimensional Spaces

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Data Mining Classification: Alternative Techniques. Imbalanced Class Problem

BBS654 Data Mining. Pinar Duygulu

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing

A PROPOSED HYBRID BOOK RECOMMENDER SYSTEM

2. On classification and related tasks

INF 4300 Classification III Anne Solberg The agenda today:

Mining di Dati Web. Lezione 3 - Clustering and Classification

MapReduce Design Patterns

CS229 Lecture notes. Raphael John Lamarre Townshend

CS 124/LINGUIST 180 From Languages to Information

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

Hybrid Recommendation System Using Clustering and Collaborative Filtering

Data Mining Classification: Bayesian Decision Theory

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

CCRMA MIR Workshop 2014 Evaluating Information Retrieval Systems. Leigh M. Smith Humtap Inc.

Nearest neighbor classification DSE 220

COMP 465: Data Mining Classification Basics

Introduction to Data Mining

CS 229 Midterm Review

INTRODUCTION TO MACHINE LEARNING. Measuring model performance or error

CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp

Fast or furious? - User analysis of SF Express Inc

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Artificial Neural Networks (Feedforward Nets)

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Graph Structure Over Time

Building a Recommendation System for EverQuest Landmark s Marketplace

Mining Web Data. Lijun Zhang

Some Interesting Applications of Theory. PageRank Minhashing Locality-Sensitive Hashing

Data Mining and Knowledge Discovery: Practice Notes

Machine learning in fmri

Predictive Indexing for Fast Search

Weka ( )

Collaborative Filtering based on User Trends

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A.

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data

Based on Raymond J. Mooney s slides

CSE 258 Lecture 6. Web Mining and Recommender Systems. Community Detection

CSE 258. Web Mining and Recommender Systems. Advanced Recommender Systems

Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis

Data Preprocessing. Slides by: Shree Jaswal

Recommender Systems - Content, Collaborative, Hybrid

Noise-based Feature Perturbation as a Selection Method for Microarray Data

Predictive Analysis: Evaluation and Experimentation. Heejun Kim

Data Mining and Knowledge Discovery Practice notes 2

Collaborative Filtering using Euclidean Distance in Recommendation Engine

Transcription:

W10.B.0.0 CS435 Introduction to Big Data W10.B.1 FAQs Term project 5:00PM March 29, 2018 PA2 Recitation: Friday PART 1. LARGE SCALE DATA AALYTICS 4. RECOMMEDATIO SYSTEMS 5. EVALUATIO AD VALIDATIO TECHIQUES Computer Science, Colorado State University http://www.cs.colostate.edu/~cs435 o office hour on Friday (10-11AM) W10.B.2 W10.B.3 Today s topics Recommendation Systems Collaborative Filtering Amazon s Item-to-Item CF Evaluation/Validation Techniques 4. Recommendation Systems W10.B.4 W10.B.5 Collaborative filtering 4. Recommendation Systems Collaborative Filtering Focus on the similarity of the user ratings for items Users are similar if their vectors are close according to some distance measure E.g. Jaccard or cosine distance Collaborative filtering The process of identifying similar users and recommending what similar users like 1

W10.B.6 Measuring similarity How to measure similarity of users or items from their rows or columns in the utility matrix? Jaccard Similarity for A and B: 1/5 Jaccard Similarity for A and C: 2/4 For user A, user C might have similar opinion than user B Can user C provide a prediction for A? HP1 HP2 HP3 TW SW1 SW2 SW3 A 4 5 1 B 5 5 4 C 2 4 5 D 3 3 W10.B.7 Cosine similarity (1/2) We can treat blanks as a 0 values The cosine of the angle between A and B is 4 5 4 2 + 5 2 +1 2 5 2 + 5 2 + 4 = 0.380 2 HP1 HP2 HP3 TW SW1 SW2 SW3 A 4 5 1 B 5 5 4 C 2 4 5 D 3 3 W10.B.8 Cosine similarity (2/2) We can treat blanks as 0 values The cosine of the angle between A and C is 5 2 +1 4 4 2 + 5 2 +1 2 2 2 + 4 2 + 5 = 0.322 2 A is slightly closer to B than to C HP1 HP2 HP3 TW SW1 SW2 SW3 A 4 5 1 B 5 5 4 C 2 4 5 D 3 3 W10.B.9 ormalizing ratings (1/2) What if we normalize ratings by subtracting from each rating the average rating of that user? Some rating (very low) will turn into negative numbers If we take the cosine distance, the opposite views of the movies will have vectors in almost opposite directions It can be as far apart as possible W10.B.10 ormalizing ratings (2/2) The cosine of the angle between A and B (2 / 3) (1/ 3) (2 / 3) 2 + (5 / 3) 2 + ( 7 / 3) 2 (1/ 3) 2 + (1/ 3) 2 + ( 2 / 3) 2 = 0.092 HP1 HP2 HP3 TW SW1 SW2 SW3 A 2/3 4 5/35-7/3 1 B 1/3 5 1/3 5-2/3 4 C -5/3 2 1/3 4 4/3 5 D 03 03 W10.B.11 The cosine of the angle between A and C (5 / 3) ( 5 / 3)+ ( 7 / 3) (1/ 3) (2 / 3) 2 + (5 / 3) 2 + ( 7 / 3) 2 ( 5 / 3) 2 + (1/ 3) 2 + (4 / 3) 2 = 0.559 A and C are much further apart than A and B. either pair is very close A and C disagree on the two movies they rated in common, while A and B give similar scores to the one movie they rated in common HP1 HP2 HP3 TW SW1 SW2 SW3 A 2/3 5/3-7/3 B 1/3 1/3-2/3 C -5/3 1/3 4/3 D 0 0 2

W10.B.12 Computational complexity Worst case O(M) where M is the number of customers and is the number of product catalog items It examines M customers and up to items for each customer W10.B.13 Computational complexity (1/3) The average customer vector is extremely sparse! The algorithm s performance tends to be closer to O(M+) Scanning every customer O(M) not O(M) Almost every customer has very small Few customers who have purchased or rated a significant percentage of items Requires O() 10 million customers and 1 million items? W10.B.14 Computational complexity (2/3) We can reduce M by: Randomly sampling the customers Discarding customers with few purchases W10.B.15 Computational complexity (3/3) Dimensionality reduction techniques can reduce M or by a large factor Clustering Principal component analysis We can reduce by: Discarding very popular or unpopular items Partitioning the item space based on the product category or subject classification W10.B.16 W10.B.17 Disadvantage of space reduction Reduced recommendation quality Sampled customer More similar customers will be dropped Item-space partitioning It will restrict recommendations to a specific product or subject area Discarding most popular or unpopular items They will never appear as recommendations 4. Recommendation Systems Amazon.com : Item-to-item collaborative filtering 3

W10.B.18 W10.B.19 This material is built based on, Greg Linden, Brent Smith, and Jeremy York, Amazon.com Recommendations, Item-to-Item Collaborative Filtering IEEE Internet Computing, 2003 Amazon.com uses recommendations as a targeted marketing tool Email campaigns Most of their web pages W10.B.20 Item-to-item collaborative filtering It does OT match the user to similar customers Item-to-item collaborative filtering Matches each of the user s purchased and rated items to similar items Combines those similar items into a recommendation list W10.B.21 Determining the most-similar match The algorithm builds a similar-items table By finding items that customers tend to purchase together How about building a product-to-product matrix by iterating through all item pairs and computing a similarity metric for each pair? Many product pairs have no common customer If you already bought a TV today, will you buy another TV again today? W10.B.22 Co-occurrence in a product-to-product matrix Suppose that there are 4 users A, B, C, D, and E with 5 products P1, P2, P3, P4, and P5 P1 P2 A purchased P1, P2, P3 B purchased P2, P3, P5 C purchased P3, P4 D purchased P2, P4 W10.B.23 Co-occurrence in a product-to-product matrix Suppose that there are 4 users A, B, C, D, and E with 5 products p1, p2, p3, p4, and p5 A purchased p1, p2, p3 p1 p2 p3 p4 p5 B purchased p2, p3, p5 p1 0 1 1 0 0 C purchased p3, p4 D purchased p2, p4 p2 1 0 1 1 1 p3 1 1 0 1 0 P3 P4 P5 p4 0 1 1 0 0 p5 0 1 0 0 0 4

W10.B.24 Similarity product-to-product matrix Similarity between products P1: T-shirt P2: video cable P3: projector P4: tv P5: smart tv Assume that, Similarity (P1, P2)=0.001 Similarity (P1, P3)=0.001 Similarity (P1, P4)=0.001 Similarity (P1, P2)=0.001 Similarity (P2, P3)=0.013 Similarity (P2, P4)=0.022 Similarity (P2, P5)=0.022 Similarity (P3, P4)=0.26 Similarity (P3, P5)=0.26 Similarity (P4, P5)=0.72 P1 P2 P3 P4 P5 p1 p2 p3 p4 p5 p1 0 0.001 0.001 0 0 p2 0.001 0 0.013 0.022 0.022 p3 0.001 0.013 0 0.26 0 p4 0 0.022 0.26 0 0 p5 0 0.022 0 0 0 W10.B.25 Calculating the similarity between a single product and all related products: For each item in product catalog, I1 For each customer C who purchased I1 For each item I2 purchased by customer C Record that a customer purchased I1 and I2 For each item I2 Compute the similarity between I1 and I2 W10.B.26 Computing similarity Using cosine measure Each vector corresponds to an item rather than a customer M dimensions correspond to customers who have purchased that item W10.B.27 Creating a similarity product-to-product table Similar-items table is extremely computing intensive Offline computation O( 2 M) in the worst case Where is the number of items and M is the number of users Average case is closer to O(M) Most customers have very few purchases Sampling customers who purchase best-selling titles reduces runtime even more With little reduction in quality W10.B.28 Scalability (1/2) Amazon.com has around 300 million customers and more than 562,382,292 cataloged items Traditional collaborative filtering does little or no offline computation Online computation scales with the number of customers and catalog items. W10.B.29 Scalability (2/2) Cluster models can perform much of the computation offline Recommendation quality is relatively poor Content-based model It cannot provide recommendations with interesting, targeted titles ot scalable for customers with numerous purchases and ratings 5

W10.B.30 Key scalability strategy for amazon recommendations W10.B.31 Recommendation quality Creating the expensive similar-items table offline Online component Looking up similar items for the user s purchases and ratings Scales independently of the catalog size or the total number of customers The algorithm recommends highly correlated similar items Recommendation quality is excellent Algorithm performs well with limited user data It is dependent only on how many titles the user has purchased or rated W10.B.32 W10.B.33 Why evaluation/validation? (1/2) Process for model selection and performance estimation Model selection (fitting the model) Most of the models have one or more free parameters h θ (x) = θ 0 + θ 1 x 1 You should find these values to generate (fit) your model. How do we select the optimal parameter(s) or model for a given classification problem/predictive analytics? W10.B.34 Why evaluation/validation? (2/2) Performance estimation Once we have chosen a model, how do we estimate its performance (accuracy)? Performance is typically measured by the true error rate e.g. the classifier s error rate on the entire population W10.B.35 Challenges (1/2) If we had access to an unlimited number of examples these questions have a straightforward answer Choose the model that provides the lowest error rate on the entire population Of course, that error rate is the true error rate In real applications we only have access to a subset of examples, usually smaller than we wanted What if we use the entire available data to fit our model and estimate the error rate? The final model will normally overfit the training data We already used the test dataset to train the data 6

W10.B.36 W10.B.37 Challenges (2/2) This problem is more pronounced with models that have a large number of parameters The error rate estimate will be overly optimistic (lower than the true error rate) In fact, it is not uncommon to have 100% correct classification on training data 1. The Holdout method A much better approach is to split the training data into disjoint subsets: the holdout method W10.B.38 The Holdout Method W10.B.39 Drawbacks of the holdout method Split dataset into two groups Training set Used to train the model Test set Used to estimate the error rate of the trained model Training Set Total umber of Examples Simplest cross validation method Test set Drawbacks For a sparse dataset, we may not be able to set aside a portion of the dataset for testing Based on the where split happens, the estimate of error can be misleading Sample might not be representative The limitations of the holdout can be overcome with a family of resampling methods More computational expense Stratified sampling Cross Validation Random subsampling K-Fold cross validation Leave-one-out Cross-Validation W10.B.40 W10.B.41 Random Subsampling 2. Random Subsampling K data splits of the dataset Each split randomly selects a (fixed) number of examples without replacement For each data split, retain the classifier from scratch with the training examples and estimate E i with the test examples Test example Experiment 1 Experiment 2 Experiment 3 Total number of examples 7

W10.B.42 W10.B.43 True Error Estimate The true error estimate is obtained as the average of the separate estimates E i This estimate is significantly better than the holdout estimate 3. k-fold Cross-validation E = 1 K K i=1 E i W10.B.44 k-fold Cross-validation Create a k-fold partition of the dataset For each of the k experiments use K-1 folds for training The remaining one for testing Test example Experiment 1 Experiment 2 Experiment 3 W10.B.45 True error estimate k-fold cross validation is similar to random subsampling The advantage of k-fold Cross validation All the examples in the dataset are eventually used for both training and testing The true error is estimated as the average error rate E = 1 K K i=1 E i Experiment 4 Total number of examples W10.B.46 4. Leave-one-out Cross-validation W10.B.47 Leave-one-out Cross Validation Leave-one-out is the degenerate case of k-fold Cross validation k is chosen as the total number of examples For a dataset with examples, perform experiments Use -1 examples for training, the remaining example for testing Single Test example Experiment 1 Experiment 2. Experiment Total number of examples 8

W10.B.48 How many folds are needed? W10.B.49 Three-way data splits The choice of the number of folds depends on the size of the dataset For large datasets, even 3-Fold Cross Validation will be quite accurate For very sparse datasets, you may have to consider leave-one-out To get maximum number of experiments A common choice for k-fold Cross Validation is k=10 If model selection and true error estimates are computed simultaneously The data needs to be divided into three disjoint sets Training set E.g. to find the optimal weights Validation set A set of examples used to tune the parameters of a model To find the optimal number of hidden units or determine a stopping point for the back propagation algorithm Test set Used only to assess the performance of a fully-trained model After assessing the final model with the test set, you must not further tune the model W10.B.50 W10.B.51 Classification Mammals Fishes Classifiers Birds Insects W10.B.52 Plain Accuracy Classifier accuracy General measure of classifier performance Accuracy = (umber of correct decisions made) / (Total number of decision made) 1 error rate Pros Very easy to measure Cons Cannot consider realistic cases W10.B.53 The Confusion Matrix A type of contingency table n classes n x n matrix The columns labeled with actual classes The rows with predicted classes Separates out the decisions made by the classifier How one class is being confused for another Different sorts of errors may be dealt with separately p (positive) n (negative) Y (predicted) True positive False positive (predicted) False negative True negative 9

W10.B.54 W10.B.55 Problems with Unbalanced Classes Consider a classification problem where one class is rare Sifting through a large population of normal entities to find a relatively small number of unusual ones Looking for defrauded customers, or defective parts The class distribution is unbalanced or skewed Confusion Matrix of A: evaluated with 1,000 datapoints Confusion Matrix of B: evaluated with 1,000 datapoints Why accuracy is misleading Which model is better? Balanced Population Y 50% P Y errors 50% Y True Population P Y Y Y 10% 90% p n Y 500 200 0 300 Which model is better? p n Y 300 0 200 500 A B p n Y 500 200 0 300 Confusion Matrix of A A p n Y 300 0 200 500 Confusion Matrix of B B W10.B.56 W10.B.57 F-measure (F1 score) Summarizes confusion matrix True positives (TP), False Positives (FP), True egatives (T), and False egatives (F) p n True positive rate (sensitivity) = TP/(TP+F) False negative rate (miss rate) = F/(TP+F) Y TP F FP T Questions? F-measure = 2(precision x recall)/(precision + recall) precision = TP / (TP+FP) recall = TP / (TP+F) Accuracy = (TP + T) / (P + ) 10