# By Atul S. Kulkarni Graduate Student, University of Minnesota Duluth. Under The Guidance of Dr. Richard Maclin

Save this PDF as:

Size: px
Start display at page:

Download "By Atul S. Kulkarni Graduate Student, University of Minnesota Duluth. Under The Guidance of Dr. Richard Maclin"

## Transcription

1 By Atul S. Kulkarni Graduate Student, University of Minnesota Duluth Under The Guidance of Dr. Richard Maclin

2 Outline Problem Statement Background Proposed Solution Experiments & Results Related Work Future Work Conclusion Q & A

3

4 Problem Statement Given a set of users with their previous ratings for a set of movies, can we predict the rating they will assign to a movie they have not previously rated? Netflix puts it as The Netflix Prize seeks to substantially improve the accuracy of predictions about how much someone is going to love a movie based on their movie preferences. Improve it enough and you win one (or more) Prizes. Winning the Netflix Prize improves our ability to connect people to the movies they love. So what do they want? 10% improvement to their existing system. They are paying \$1 Million for this.

5 Problem Statement Similarly, which movie will you like given that you have seen X-Men, X-Men II, X-Men : The Last Stand and users who saw these movies also liked X-Men Origins : Wolverine? Answer:?

6 Dataset Background for the problem Background for the Solution

7 Background - Dataset Netflix Prize Dataset Netflix released data for this competition Contains nearly 100 Million ratings Number of users (Anonymous) = 480,189 Number of movies rated by them = 17,770 Training Data is provided per movie To verify the model developed without submitting the predictions to Netflix probe.txt is provided To submit the predictions for competition qualifying.txt is used

8 Background - Dataset Data in the training file is per movie It looks like this Movie# Customer#,Rating,Date of Rating Customer#,Rating,Date of Rating Customer#,Rating,Date of Rating - Example 4: ,3, ,1, ,5,

9 Background - Dataset Data points in the probe.txt looks like this (Have answers) Movie# Customer# Customer# Data in the qualifying.txt looks like this (No answers) Movie# Customer#, DateofRating Customer#, DateofRating 1: : , , ,

10 Background Dataset stats Total ratings possible = 480,189 (user) * 17,770 (movies) = (8.5 Billion) Total available = 100 Million The User x Movies matrix has 8.4 Billion entries missing Sparse Data

11 Background of the problem Recommender Systems Examples: Yahoo, Google, youtube, Amazon. Recommend item that you might like. The recommendation is made based on past behavior. Collaborative Filtering [Gábor, 2009] What is it? Who collaborates and what is filtered? How can it be applied in this contest?

12 Background of the problem Earlier systems implemented in 1990s. GroupLens (Usenet articles) [Resnick, 1997] Siteseer (Cross linking technical papers)[resnick, 1997] Tapestry ( filtering) [Goldberg, 1992] Earlier solutions provided for users to rate the item. Two major divisions of methods Model based fit a model to the training data. Memory based Nearest Neighbor Methods.

13 Background for the Solution K-Nearest Neighbor (K-NN) method. Memory Based method. Measures Distance between the query instance and every instance in the training set. Find the K training instances with the least distance from query instance. Using these K instances, average their rating for this movie for these training instances. Distances can be measured using the following formulae.

14 Background for the Solution Distance formulae. Distance Formula Manhattan Distance features f 1 x i, f x j, f Euclidean Distance features f 1 ( x i, f x j, f ) 2 Minkowsky Distance p features ( x i x f 1 p, f j, f ) Mahalanobis Distance 1 ( x x ) ( x x ) i j i j T

15 Background for the Solution How important is distance measure? Curse of Dimensionality. Example: what if we were to characterize the movie by it actors, directors, writers, genre, and then all of its CREW? What is the problem? What if some attributes are more dominant than others? Example: Cost of home are much larger quantities than person s height.

16 Background of the Solution What if I was very conservative about my rating and someone else was too generous? I rate the movie I like the most as 3 and the least as 1. someone else rates his/her high at 5 and high at 3. So am I like this person? Difficult to say. We are comparing two people with very high personal biases. Which will result in obvious flawed similarity measure. Solution? Normalization of the data.

17 Background for the Solution Normalization What is that? How do we do it? How will it change my ratings? Won t I loose the original rating? We will calculate Mean rating for every user over the movies he / she has rated Also calculate standard deviation for the user s rating. From every rating we will subtract the user s mean rating and divide it by their standard deviation.

18 Background for the Solution Should all members of the neighborhood contribute equally to the prediction? Not always, we can argue that people who are similar to you, i.e. have least distance from you should contribute more than farther ones. This is done by weighing the prediction by the instance s distance from the query instance.

19 Background for the Solution Clustering Idea is to group the items together based on their attributes. Data is typically unlabeled. Similarity is measured using the distance between the two points. Example: Consider going in to a comic book shop and putting together comics from a pile of comics that are similar. Types: Partitional Clustering: K-Means Hierarchical clustering: Agglomerative Clustering

20 Background for the Solution K-Means clustering [MacQueen, 1967] Randomly select K instances as cluster centers. Label every data point with its nearest cluster centers. Re-compute the cluster centers. Repeat the above two steps until no instances change clusters or certain iterations have gone by. How is it related to our discussion today?

21 K Nearest Neighbor Algorithm Clustering Based Nearest Neighbor Algorithm

22 Proposed Solution K-Nearest Neighbor approach (Overview) Given a query instance q(movieid, UserId) normalize the data before processing. Find the distance of this instance with all the users who rated this movie. Of the these users select the K users that are nearest to the query instance as its neighborhood. Average the rating of the users form this neighborhood for this particular movie. This is the predicted rating for the query instance.

23 Proposed Solution - Example Example: (Representative data, not real) Matrix Star Wars Dark knight Rocky Sita Aur Gita Star Trek Cliffhanger A.I. MI X-Men Jim Sean John Sidd Penny Pete 5? 4 4

24 Proposed Solution - Example calculate the Mean and Standard Deviation vectors. meanrating standarddeviation Jim Sean John Sidd Penny Pete

25 Proposed Solution - Example Normalized data Matrix Star Wars Dark knight Rocky Sita Aur Gita Star Trek Cliffhanger A.I. MI X-Men Jim Sean John Sidd Penny Pete 1.15?

26 Proposed Solution - Example So now we have a query instance q(pete, Sita Aur Gita) i.e. we wish to evaluate how much will Pete like movie Sita Aur Gita on a scale of 1-5. To do this we need to indentify Pete s two neighbors who rated this movie. (2-NN case). Users who rated the movie Sita Aur Gita are. candidate_users Jim Sidd Penny

27 Proposed Solution - Example Users with their distance and the 2 neighbors in the neighborhood are Users Distance Jim Sidd Peny Nearest Neighbors are Jim and Sidd.

28 Proposed Solution - Example The average of the ratings by Jim and Sidd to movie Sita Aur Gita is So is our prediction correct? Not yet. This prediction is in normalized form. We need to bring it back to Pete s prediction level. How? Multiply by Standard Deviation of Pete s ratings. Add Pete mean rating to this product. ( * ) = So predicted rating for Pete is

29 Proposed Solution C-K-NN Clustering based Nearest Neighbor appraoch Obtain for every movie its genre from external sources. (IMDB in our case) Create for every user a vector representing each genre as one cell. In that cell we count number of movies that users has rated for the genre. (We have one such vector for each user.) Cluster the users as per the genres of the movie they have rated. Cluster centers of these clusters represent the collective opinion of the users in that cluster about the movies of that particular genre. We call them Super Users

30 Proposed Solution C-K-NN For each super user we predict rating of all the movies of that genre as the average of the ratings of the users that rated the movie. When presented with query point q(movieid, userid). We find all the genre for that movie. For each genre we calculate distance of the user from cluster centers for the genre. We select the nearest K cluster centers and average the rating of these cluster centers for the movie to predict movie rating for this genre. We average per genre predicted rating and get the predicted ratings for q.

31 Proposed Solution Example (C-K-NN) We use the data from our previous example. (recap) Matrix Star Wars Dark knight Rocky Sita Aur Gita Star Trek Cliffhanger A.I. MI X-Men Jim Sean John Sidd Penny Pete 5? 4 4

32 Proposed Solution Example (C-K-NN) We find genre for every movie. Action Adventure Crime Drama Fantasy Sci-Fi Sport Thriller Matrix Star Wars Dark Knight Rocky 1 1 Sita Aur Gita 1 1 Star Trek 1 1 Cliffhanger A.I MI X-Men 1 1 1

33 Proposed Solution Example (C-K-NN) Convert User Movie Data to User Genre Action Adventure Crime Drama Fantasy Sci-Fi Sport Thriller Jim Sean John Sidd Penny Pete

34 Proposed Solution Example (C-K-NN) We cluster users in to two clusters. Action Adventure Crime Drama Fantasy Sci-Fi Sport Thriller Jim Sean John Sidd Penny Pete

35 Proposed Solution Example (C-K-NN) The query point as last time is q(pete, Sita Aur Gita) Per genre cluster look like (Genres of Sita Aur Gita ) Adventure Drama 1 2 Matrix Star Wars Sita Aur Gita 2 3 A.I. 5 Sita Aur Gita Star Trek 3 4 Cliffhanger A.I. 5 MI 2 1

36 Proposed Solution Example (C-K-NN) Distance of Pete from Cluster centers of Adventure 1 2 Pete Distance of Pete from Cluster centers of Drama Not applicable as Pete does not rate any movie from that genre. We try to find one (K=1) nearest cluster for Adventure genre. That is cluster two.

37 Proposed Solution Example (C-K-NN) Hence, the rating for the query point q(pete, Sita Aur Gita) calculated by taking the rating of cluster two of adventure genre. Our prediction is: 2 for this movie. What if Pete had rated one of the movies from drama genre? We would predict the rating for Drama genre as well for Pete Then, average the predicted rating for the two genre to get the final rating.

38

39 Experiments Setup Dataset used Netflix Prize Dataset. Experiments performed on Randomly selected 1121 movies covering users. These data instances are chosen form the probe file from the Netflix Dataset. We have the ratings for these instances in the training data. These instances are treated as Hold out set in the experiments.

40 Experiments Setup We normalize the data for the K-NN method Predictions so done are converted back to the denormalized form We test the same set of movie, user pairs on both methods Standard K-Nearest Neighbor Clustered-K-Nearest Neighbor

41 Experiments - Setup This is a regression problem, hence we want to know if we are off the expected value, how off are we? Hence, Test Metric used is Root Mean Square Error (RMSE): Absolute Average Error (AAE): Time taken.

42 Experiments - Implementation K-NN Implemented C / C++. Classes converted to Structure. Difficult to manage the massive dataset in the memory. Size of the program makes it difficult to run in C++ Comparison to every user needs a lot of fine tuning of the code to achieve a reasonable performance K-NNs inherent problem Ease of implementation vs. speed is important trade off Using maps, vectors only adds storage speed added is negated by this.

43 Experiments - Implementation C-K-NN Implemented using Perl, Matlab, Python, MySQL. Perl s hashes of hashes came to rescue Ease of token / string processing was most helpful Complex logic hence easy to express in Perl (Regex help) Python Interfaces with IMDB (IMDbPY), MySQL has local database of IMDB. Matlab does the clustering (K-Means) Fine tuning of algorithm and ample available memory negates the slow / interpreted nature of the languages.

44 Experiments - Results Result on described dataset Method Absolute Average Error Root Mean Square Error Time (Minutes) K-NN * C-K-NN Netflix (Ladder Board NA NA Topper) Netflix Current System 1 NA NA

45 Experiments - Results RMSE Comparisons Comparison of the RMSE and Absolute Average Error Time taken Time in Minutes RMSE Time in Minutes Absolu te Averag e Error K-NN C-K-NN Netflix (Current Topper) Netflix (Current System) 0 K-NN C-K-NN

46 Experiments - Results Distribution of the Absolute Average Error for K-NN and C-K-NN methods Number of Movies with error for standard K- NN method Number of Movies with error for C-K-NN method

47

48 Related Work Methods already applied to this problem are Matrix Factorization Methods Regularized Singular Value Decomposition [Paterek, 2007][Webb, 2007] Baises with Regularized SVD [Paterek, 2007] Probabilistic Latent Semantic Analysis plsa [(Hofmann, 2004] Nearest Neighbor Methods [Bell and Koren, 2007] Alternate Least Squares [Bell and Koren, 2007] Post processing of SVD features. [Paterek, 2007]

49

50 Future Work K-NN method Different values of K could experimented Distributed processing of this problem Distance weighing the contributions from neighbors C-K-NN Trying different # of clusters Dates provided with the ratings could be used in clustering along with genre More information form IMDB or other sources might included Application of Movie clustering and then predicting the rating for users is also possible

51

52 Conclusions We presented results of two methods to solve the Netflix Prize Problem including a novel based clustering method First method, a standard K-Nearest Neighbor method although gets lower RMSE value is very slow in prediction A function of comparison with every user who rated this movie Second method, clusters the users based on the genre of the movies they rated and creates super users from these clusters

53 Conclusions Standard K-NN method performs slightly better compared to the Clustering based method on the Root Mean Square Error metric but is extremely slow Our clustering based method has higher Root Mean Square Error than Standard K-NN method but is extremely fast and practical for large scale method implementations It also shows promise of being accurate for many predictions

54

55 Atul S Kulkarni

### CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS6: Mining Massive Datasets Jure Leskovec, Stanford University http://cs6.stanford.edu Training data 00 million ratings, 80,000 users, 7,770 movies 6 years of data: 000 00 Test data Last few ratings of

### Performance Comparison of Algorithms for Movie Rating Estimation

Performance Comparison of Algorithms for Movie Rating Estimation Alper Köse, Can Kanbak, Noyan Evirgen Research Laboratory of Electronics, Massachusetts Institute of Technology Department of Electrical

### COMP 465: Data Mining Recommender Systems

//0 movies COMP 6: Data Mining Recommender Systems Slides Adapted From: www.mmds.org (Mining Massive Datasets) movies Compare predictions with known ratings (test set T)????? Test Data Set Root-mean-square

### Use of KNN for the Netflix Prize Ted Hong, Dimitris Tsamis Stanford University

Use of KNN for the Netflix Prize Ted Hong, Dimitris Tsamis Stanford University {tedhong, dtsamis}@stanford.edu Abstract This paper analyzes the performance of various KNNs techniques as applied to the

### Recommendation Systems

Recommendation Systems CS 534: Machine Learning Slides adapted from Alex Smola, Jure Leskovec, Anand Rajaraman, Jeff Ullman, Lester Mackey, Dietmar Jannach, and Gerhard Friedrich Recommender Systems (RecSys)

### How to predict IMDb score

How to predict IMDb score Jiawei Li A53226117 Computational Science, Mathematics and Engineering University of California San Diego jil206@ucsd.edu Abstract This report is based on the dataset provided

### General Instructions. Questions

CS246: Mining Massive Data Sets Winter 2018 Problem Set 2 Due 11:59pm February 8, 2018 Only one late period is allowed for this homework (11:59pm 2/13). General Instructions Submission instructions: These

### Progress Report: Collaborative Filtering Using Bregman Co-clustering

Progress Report: Collaborative Filtering Using Bregman Co-clustering Wei Tang, Srivatsan Ramanujam, and Andrew Dreher April 4, 2008 1 Introduction Analytics are becoming increasingly important for business

### Jeff Howbert Introduction to Machine Learning Winter

Collaborative Filtering Nearest es Neighbor Approach Jeff Howbert Introduction to Machine Learning Winter 2012 1 Bad news Netflix Prize data no longer available to public. Just after contest t ended d

### Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011 1. Introduction Reddit is one of the most popular online social news websites with millions

### Introduction. Chapter Background Recommender systems Collaborative based filtering

ii Abstract Recommender systems are used extensively today in many areas to help users and consumers with making decisions. Amazon recommends books based on what you have previously viewed and purchased,

### Data Mining Techniques

Data Mining Techniques CS 6 - Section - Spring 7 Lecture Jan-Willem van de Meent (credit: Andrew Ng, Alex Smola, Yehuda Koren, Stanford CS6) Project Project Deadlines Feb: Form teams of - people 7 Feb:

### Mining Web Data. Lijun Zhang

Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

### Collaborative Filtering for Netflix

Collaborative Filtering for Netflix Michael Percy Dec 10, 2009 Abstract The Netflix movie-recommendation problem was investigated and the incremental Singular Value Decomposition (SVD) algorithm was implemented

### A Recommender System. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2018

A Recommender System John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center Copyright 2018 Obvious Applications We are now advanced enough that we can aspire to a serious application.

### Property1 Property2. by Elvir Sabic. Recommender Systems Seminar Prof. Dr. Ulf Brefeld TU Darmstadt, WS 2013/14

Property1 Property2 by Recommender Systems Seminar Prof. Dr. Ulf Brefeld TU Darmstadt, WS 2013/14 Content-Based Introduction Pros and cons Introduction Concept 1/30 Property1 Property2 2/30 Based on item

### Recommender Systems. Techniques of AI

Recommender Systems Techniques of AI Recommender Systems User ratings Collect user preferences (scores, likes, purchases, views...) Find similarities between items and/or users Predict user scores for

### Recommender Systems. Collaborative Filtering & Content-Based Recommending

Recommender Systems Collaborative Filtering & Content-Based Recommending 1 Recommender Systems Systems for recommending items (e.g. books, movies, CD s, web pages, newsgroup messages) to users based on

### arxiv: v4 [cs.ir] 28 Jul 2016

Review-Based Rating Prediction arxiv:1607.00024v4 [cs.ir] 28 Jul 2016 Tal Hadad Dept. of Information Systems Engineering, Ben-Gurion University E-mail: tah@post.bgu.ac.il Abstract Recommendation systems

### Extension Study on Item-Based P-Tree Collaborative Filtering Algorithm for Netflix Prize

Extension Study on Item-Based P-Tree Collaborative Filtering Algorithm for Netflix Prize Tingda Lu, Yan Wang, William Perrizo, Amal Perera, Gregory Wettstein Computer Science Department North Dakota State

CS249: ADVANCED DATA MINING Recommender Systems II Instructor: Yizhou Sun yzsun@cs.ucla.edu May 31, 2017 Recommender Systems Recommendation via Information Network Analysis Hybrid Collaborative Filtering

### Data Mining Concepts & Tasks

Data Mining Concepts & Tasks Duen Horng (Polo) Chau Georgia Tech CSE6242 / CX4242 Sept 9, 2014 Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos Last Time

### Part 12: Advanced Topics in Collaborative Filtering. Francesco Ricci

Part 12: Advanced Topics in Collaborative Filtering Francesco Ricci Content Generating recommendations in CF using frequency of ratings Role of neighborhood size Comparison of CF with association rules

### Matrix-Vector Multiplication by MapReduce. From Rajaraman / Ullman- Ch.2 Part 1

Matrix-Vector Multiplication by MapReduce From Rajaraman / Ullman- Ch.2 Part 1 Google implementation of MapReduce created to execute very large matrix-vector multiplications When ranking of Web pages that

### Introduction to Machine Learning. Xiaojin Zhu

Introduction to Machine Learning Xiaojin Zhu jerryzhu@cs.wisc.edu Read Chapter 1 of this book: Xiaojin Zhu and Andrew B. Goldberg. Introduction to Semi- Supervised Learning. http://www.morganclaypool.com/doi/abs/10.2200/s00196ed1v01y200906aim006

### 5/13/2009. Introduction. Introduction. Introduction. Introduction. Introduction

Applying Collaborative Filtering Techniques to Movie Search for Better Ranking and Browsing Seung-Taek Park and David M. Pennock (ACM SIGKDD 2007) Two types of technologies are widely used to overcome

### Dimension Reduction CS534

Dimension Reduction CS534 Why dimension reduction? High dimensionality large number of features E.g., documents represented by thousands of words, millions of bigrams Images represented by thousands of

### Chapter 6: Cluster Analysis

Chapter 6: Cluster Analysis The major goal of cluster analysis is to separate individual observations, or items, into groups, or clusters, on the basis of the values for the q variables measured on each

### Comparison of Recommender System Algorithms focusing on the New-Item and User-Bias Problem

Comparison of Recommender System Algorithms focusing on the New-Item and User-Bias Problem Stefan Hauger 1, Karen H. L. Tso 2, and Lars Schmidt-Thieme 2 1 Department of Computer Science, University of

### Feature Extractors. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. The Perceptron Update Rule.

CS 188: Artificial Intelligence Fall 2007 Lecture 26: Kernels 11/29/2007 Dan Klein UC Berkeley Feature Extractors A feature extractor maps inputs to feature vectors Dear Sir. First, I must solicit your

### Web Personalization & Recommender Systems

Web Personalization & Recommender Systems COSC 488 Slides are based on: - Bamshad Mobasher, Depaul University - Recent publications: see the last page (Reference section) Web Personalization & Recommender

### Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Unsupervised Learning Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Content Motivation Introduction Applications Types of clustering Clustering criterion functions Distance functions Normalization Which

### Collaborative Filtering based on User Trends

Collaborative Filtering based on User Trends Panagiotis Symeonidis, Alexandros Nanopoulos, Apostolos Papadopoulos, and Yannis Manolopoulos Aristotle University, Department of Informatics, Thessalonii 54124,

### MovieNet: A Social Network for Movie Enthusiasts

MovieNet: A Social Network for Movie Enthusiasts 445 Course Project MovieNet is a social network for movie enthusiasts, containing a database of movies, actors/actresses, directors, etc., and a social

### Web Personalization & Recommender Systems

Web Personalization & Recommender Systems COSC 488 Slides are based on: - Bamshad Mobasher, Depaul University - Recent publications: see the last page (Reference section) Web Personalization & Recommender

### Dimension reduction : PCA and Clustering

Dimension reduction : PCA and Clustering By Hanne Jarmer Slides by Christopher Workman Center for Biological Sequence Analysis DTU The DNA Array Analysis Pipeline Array design Probe design Question Experimental

### Yelp Recommendation System

Yelp Recommendation System Jason Ting, Swaroop Indra Ramaswamy Institute for Computational and Mathematical Engineering Abstract We apply principles and techniques of recommendation systems to develop

### Factor in the Neighbors: Scalable and Accurate Collaborative Filtering

1 Factor in the Neighbors: Scalable and Accurate Collaborative Filtering YEHUDA KOREN Yahoo! Research Recommender systems provide users with personalized suggestions for products or services. These systems

### Unsupervised Learning

Networks for Pattern Recognition, 2014 Networks for Single Linkage K-Means Soft DBSCAN PCA Networks for Kohonen Maps Linear Vector Quantization Networks for Problems/Approaches in Machine Learning Supervised

### Using Machine Learning to Optimize Storage Systems

Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation

### Large-scale visual recognition Efficient matching

Large-scale visual recognition Efficient matching Florent Perronnin, XRCE Hervé Jégou, INRIA CVPR tutorial June 16, 2012 Outline!! Preliminary!! Locality Sensitive Hashing: the two modes!! Hashing!! Embedding!!

### Mining Web Data. Lijun Zhang

Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

### Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Cluster Analysis Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX 7575 April 008 April 010 Cluster Analysis, sometimes called data segmentation or customer segmentation,

### CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

CS 1675 Introduction to Machine Learning Lecture 18 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem:

### June 15, Abstract. 2. Methodology and Considerations. 1. Introduction

Organizing Internet Bookmarks using Latent Semantic Analysis and Intelligent Icons Note: This file is a homework produced by two students for UCR CS235, Spring 06. In order to fully appreacate it, it may

### Clustering. Lecture 6, 1/24/03 ECS289A

Clustering Lecture 6, 1/24/03 What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed

### Kapitel 4: Clustering

Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases WiSe 2017/18 Kapitel 4: Clustering Vorlesung: Prof. Dr.

### Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008 Lecture 25: Kernels and Clustering 12/2/2008 Dan Klein UC Berkeley Case-Based Reasoning Similarity for classification Case-based reasoning Predict an instance

### CPSC 340: Machine Learning and Data Mining. Kernel Trick Fall 2017

CPSC 340: Machine Learning and Data Mining Kernel Trick Fall 2017 Admin Assignment 3: Due Friday. Midterm: Can view your exam during instructor office hours or after class this week. Digression: the other

### Measure of Distance. We wish to define the distance between two objects Distance metric between points:

Measure of Distance We wish to define the distance between two objects Distance metric between points: Euclidean distance (EUC) Manhattan distance (MAN) Pearson sample correlation (COR) Angle distance

### Nearest Neighbor Classification

Nearest Neighbor Classification Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 1 / 48 Outline 1 Administration 2 First learning algorithm: Nearest

### Distribution-free Predictive Approaches

Distribution-free Predictive Approaches The methods discussed in the previous sections are essentially model-based. Model-free approaches such as tree-based classification also exist and are popular for

### Does Wikipedia Information Help Netflix Predictions?

Does Wikipedia Information Help Netflix Predictions? John Lees-Miller, Fraser Anderson, Bret Hoehn, Russell Greiner University of Alberta Department of Computing Science {leesmill, frasera, hoehn, greiner}@cs.ualberta.ca

### Clustering. Bruno Martins. 1 st Semester 2012/2013

Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2012/2013 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti. Outline 1 Motivation Basic Concepts

### Lab # 2 - ACS I Part I - DATA COMPRESSION in IMAGE PROCESSING using SVD

Lab # 2 - ACS I Part I - DATA COMPRESSION in IMAGE PROCESSING using SVD Goals. The goal of the first part of this lab is to demonstrate how the SVD can be used to remove redundancies in data; in this example

### Introduction to Artificial Intelligence

Introduction to Artificial Intelligence COMP307 Machine Learning 2: 3-K Techniques Yi Mei yi.mei@ecs.vuw.ac.nz 1 Outline K-Nearest Neighbour method Classification (Supervised learning) Basic NN (1-NN)

### Collaborative Filtering using Weighted BiPartite Graph Projection A Recommendation System for Yelp

Collaborative Filtering using Weighted BiPartite Graph Projection A Recommendation System for Yelp Sumedh Sawant sumedh@stanford.edu Team 38 December 10, 2013 Abstract We implement a personal recommendation

### Document Clustering: Comparison of Similarity Measures

Document Clustering: Comparison of Similarity Measures Shouvik Sachdeva Bhupendra Kastore Indian Institute of Technology, Kanpur CS365 Project, 2014 Outline 1 Introduction The Problem and the Motivation

### CHAPTER 4 STOCK PRICE PREDICTION USING MODIFIED K-NEAREST NEIGHBOR (MKNN) ALGORITHM

CHAPTER 4 STOCK PRICE PREDICTION USING MODIFIED K-NEAREST NEIGHBOR (MKNN) ALGORITHM 4.1 Introduction Nowadays money investment in stock market gains major attention because of its dynamic nature. So the

### Recommender system techniques applied to Netflix movie data

Recommender system techniques applied to Netflix movie data Research Paper Business Analytics Steven Postmus (s.h.postmus@student.vu.nl) Supervisor: Sandjai Bhulai (s.bhulai@vu.nl) Vrije Universiteit Amsterdam,

### 7. Nearest neighbors. Learning objectives. Centre for Computational Biology, Mines ParisTech

Foundations of Machine Learning CentraleSupélec Paris Fall 2016 7. Nearest neighbors Chloé-Agathe Azencot Centre for Computational Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr Learning

### ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 22, 2016 Course Information Website: http://www.stat.ucdavis.edu/~chohsieh/teaching/ ECS289G_Fall2016/main.html My office: Mathematical Sciences

### Predicting Popular Xbox games based on Search Queries of Users

1 Predicting Popular Xbox games based on Search Queries of Users Chinmoy Mandayam and Saahil Shenoy I. INTRODUCTION This project is based on a completed Kaggle competition. Our goal is to predict which

### Lecture Map-Reduce. Algorithms. By Marina Barsky Winter 2017, University of Toronto

Lecture 04.02 Map-Reduce Algorithms By Marina Barsky Winter 2017, University of Toronto Example 1: Language Model Statistical machine translation: Need to count number of times every 5-word sequence occurs

### The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

CS 189 Spring 2015 Introduction to Machine Learning Final You have 2 hours 50 minutes for the exam. The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. No calculators or

### Data Mining for Web Personalization

Data Mining for Web Personalization Patrick Dudas Outline Personalization Data mining Examples Web mining MapReduce Data Preprocessing Knowledge Discovery Evaluation Information High 1 Personalization

### Big Data Analytics CSCI 4030

High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

### Personalized Web Search

Personalized Web Search Dhanraj Mavilodan (dhanrajm@stanford.edu), Kapil Jaisinghani (kjaising@stanford.edu), Radhika Bansal (radhika3@stanford.edu) Abstract: With the increase in the diversity of contents

### Clustering Part 4 DBSCAN

Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

### CS294-1 Assignment 2 Report

CS294-1 Assignment 2 Report Keling Chen and Huasha Zhao February 24, 2012 1 Introduction The goal of this homework is to predict a users numeric rating for a book from the text of the user s review. The

### A P2P REcommender system based on Gossip Overlays (PREGO)

10 th IEEE INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION TECHNOLOGY Bradford,UK, 29 June - 1 July, 2010 Ranieri Baraglia, Patrizio Dazzi, Matteo Mordacchini ISTI,CNR, Pisa,Italy Laura Ricci University

### Analyzing Outlier Detection Techniques with Hybrid Method

Analyzing Outlier Detection Techniques with Hybrid Method Shruti Aggarwal Assistant Professor Department of Computer Science and Engineering Sri Guru Granth Sahib World University. (SGGSWU) Fatehgarh Sahib,

### Recommender Systems using Graph Theory

Recommender Systems using Graph Theory Vishal Venkatraman * School of Computing Science and Engineering vishal2010@vit.ac.in Swapnil Vijay School of Computing Science and Engineering swapnil2010@vit.ac.in

### Clustering k-mean clustering

Clustering k-mean clustering Genome 373 Genomic Informatics Elhanan Borenstein The clustering problem: partition genes into distinct sets with high homogeneity and high separation Clustering (unsupervised)

### Cluster Analysis. Ying Shen, SSE, Tongji University

Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group

### Creating a Recommender System. An Elasticsearch & Apache Spark approach

Creating a Recommender System An Elasticsearch & Apache Spark approach My Profile SKILLS Álvaro Santos Andrés Big Data & Analytics Solution Architect in Ericsson with more than 12 years of experience focused

### Predictive Indexing for Fast Search

Predictive Indexing for Fast Search Sharad Goel, John Langford and Alex Strehl Yahoo! Research, New York Modern Massive Data Sets (MMDS) June 25, 2008 Goel, Langford & Strehl (Yahoo! Research) Predictive

### Chapter 4: Non-Parametric Techniques

Chapter 4: Non-Parametric Techniques Introduction Density Estimation Parzen Windows Kn-Nearest Neighbor Density Estimation K-Nearest Neighbor (KNN) Decision Rule Supervised Learning How to fit a density

### Jarek Szlichta

Jarek Szlichta http://data.science.uoit.ca/ Approximate terminology, though there is some overlap: Data(base) operations Executing specific operations or queries over data Data mining Looking for patterns

### Unit 8 Algebra 1. Name:

Unit 8 Algebra 1 Name: Concepts: Correlations Graphing Scatter Plots Best-Fitting Liens [calculator key strokes] 4.5 Correlation and Best-Fitting Lines Correlations We use to tell if there is a relationship

### Data Preprocessing. Javier Béjar AMLT /2017 CS - MAI. (CS - MAI) Data Preprocessing AMLT / / 71 BY: \$\

Data Preprocessing S - MAI AMLT - 2016/2017 (S - MAI) Data Preprocessing AMLT - 2016/2017 1 / 71 Outline 1 Introduction Data Representation 2 Data Preprocessing Outliers Missing Values Normalization Discretization

### How to use FSBforecast Excel add in for regression analysis

How to use FSBforecast Excel add in for regression analysis FSBforecast is an Excel add in for data analysis and regression that was developed here at the Fuqua School of Business over the last 3 years

### CS570: Introduction to Data Mining

CS570: Introduction to Data Mining Classification Advanced Reading: Chapter 8 & 9 Han, Chapters 4 & 5 Tan Anca Doloc-Mihu, Ph.D. Slides courtesy of Li Xiong, Ph.D., 2011 Han, Kamber & Pei. Data Mining.

### Clustering Results. Result List Example. Clustering Results. Information Retrieval

Information Retrieval INFO 4300 / CS 4300! Presenting Results Clustering Clustering Results! Result lists often contain documents related to different aspects of the query topic! Clustering is used to

### Allstate Insurance Claims Severity: A Machine Learning Approach

Allstate Insurance Claims Severity: A Machine Learning Approach Rajeeva Gaur SUNet ID: rajeevag Jeff Pickelman SUNet ID: pattern Hongyi Wang SUNet ID: hongyiw I. INTRODUCTION The insurance industry has

### Topic 7 Machine learning

CSE 103: Probability and statistics Winter 2010 Topic 7 Machine learning 7.1 Nearest neighbor classification 7.1.1 Digit recognition Countless pieces of mail pass through the postal service daily. A key

### CPSC 340: Machine Learning and Data Mining. Finding Similar Items Fall 2017

CPSC 340: Machine Learning and Data Mining Finding Similar Items Fall 2017 Assignment 1 is due tonight. Admin 1 late day to hand in Monday, 2 late days for Wednesday. Assignment 2 will be up soon. Start

### Latent Semantic Indexing

Latent Semantic Indexing Thanks to Ian Soboroff Information Retrieval 1 Issues: Vector Space Model Assumes terms are independent Some terms are likely to appear together synonyms, related words spelling

### ADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL

ADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY BHARAT SIGINAM IN

### Semi-Automatic Transcription Tool for Ancient Manuscripts

The Venice Atlas A Digital Humanities atlas project by DH101 EPFL Students Semi-Automatic Transcription Tool for Ancient Manuscripts In this article, we investigate various techniques from the fields of

### Slides based on those in:

Spyros Kontogiannis & Christos Zaroliagis Slides based on those in: http://www.mmds.org A 3.3 B 38.4 C 34.3 D 3.9 E 8.1 F 3.9 1.6 1.6 1.6 1.6 1.6 2 y 0.8 ½+0.2 ⅓ M 1/2 1/2 0 0.8 1/2 0 0 + 0.2 0 1/2 1 [1/N]

### Lesson 3. Prof. Enza Messina

Lesson 3 Prof. Enza Messina Clustering techniques are generally classified into these classes: PARTITIONING ALGORITHMS Directly divides data points into some prespecified number of clusters without a hierarchical

### Methods for Intelligent Systems

Methods for Intelligent Systems Lecture Notes on Clustering (II) Davide Eynard eynard@elet.polimi.it Department of Electronics and Information Politecnico di Milano Davide Eynard - Lecture Notes on Clustering

### Exploratory data analysis for microarrays

Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA

### MATH36032 Problem Solving by Computer. Data Science

MATH36032 Problem Solving by Computer Data Science NO. of jobs on jobsite 1 10000 NO. of Jobs 8000 6000 4000 2000 MATLAB Data Data Science 0 Jan 2016 Jul 2016 Jan 2017 1 http://www.jobsite.co.uk/ What

### Being Prepared In A Sparse World: The Case of KNN Graph Construction. Antoine Boutet DRIM LIRIS, Lyon

Being Prepared In A Sparse World: The Case of KNN Graph Construction Antoine Boutet DRIM LIRIS, Lyon Co-authors Joint work with François Taiani Nupur Mittal Anne-Marie Kermarrec Published at ICDE 2016

### CHAPTER 2 DESCRIPTIVE STATISTICS

CHAPTER 2 DESCRIPTIVE STATISTICS 1. Stem-and-Leaf Graphs, Line Graphs, and Bar Graphs The distribution of data is how the data is spread or distributed over the range of the data values. This is one of

### Application of Dimensionality Reduction in Recommender System -- A Case Study

Application of Dimensionality Reduction in Recommender System -- A Case Study Badrul M. Sarwar, George Karypis, Joseph A. Konstan, John T. Riedl Department of Computer Science and Engineering / Army HPC