By Atul S. Kulkarni Graduate Student, University of Minnesota Duluth. Under The Guidance of Dr. Richard Maclin


 Stewart Lawrence
 1 years ago
 Views:
Transcription
1 By Atul S. Kulkarni Graduate Student, University of Minnesota Duluth Under The Guidance of Dr. Richard Maclin
2 Outline Problem Statement Background Proposed Solution Experiments & Results Related Work Future Work Conclusion Q & A
3
4 Problem Statement Given a set of users with their previous ratings for a set of movies, can we predict the rating they will assign to a movie they have not previously rated? Netflix puts it as The Netflix Prize seeks to substantially improve the accuracy of predictions about how much someone is going to love a movie based on their movie preferences. Improve it enough and you win one (or more) Prizes. Winning the Netflix Prize improves our ability to connect people to the movies they love. So what do they want? 10% improvement to their existing system. They are paying $1 Million for this.
5 Problem Statement Similarly, which movie will you like given that you have seen XMen, XMen II, XMen : The Last Stand and users who saw these movies also liked XMen Origins : Wolverine? Answer:?
6 Dataset Background for the problem Background for the Solution
7 Background  Dataset Netflix Prize Dataset Netflix released data for this competition Contains nearly 100 Million ratings Number of users (Anonymous) = 480,189 Number of movies rated by them = 17,770 Training Data is provided per movie To verify the model developed without submitting the predictions to Netflix probe.txt is provided To submit the predictions for competition qualifying.txt is used
8 Background  Dataset Data in the training file is per movie It looks like this Movie# Customer#,Rating,Date of Rating Customer#,Rating,Date of Rating Customer#,Rating,Date of Rating  Example 4: ,3, ,1, ,5,
9 Background  Dataset Data points in the probe.txt looks like this (Have answers) Movie# Customer# Customer# Data in the qualifying.txt looks like this (No answers) Movie# Customer#, DateofRating Customer#, DateofRating 1: : , , ,
10 Background Dataset stats Total ratings possible = 480,189 (user) * 17,770 (movies) = (8.5 Billion) Total available = 100 Million The User x Movies matrix has 8.4 Billion entries missing Sparse Data
11 Background of the problem Recommender Systems Examples: Yahoo, Google, youtube, Amazon. Recommend item that you might like. The recommendation is made based on past behavior. Collaborative Filtering [Gábor, 2009] What is it? Who collaborates and what is filtered? How can it be applied in this contest?
12 Background of the problem Earlier systems implemented in 1990s. GroupLens (Usenet articles) [Resnick, 1997] Siteseer (Cross linking technical papers)[resnick, 1997] Tapestry ( filtering) [Goldberg, 1992] Earlier solutions provided for users to rate the item. Two major divisions of methods Model based fit a model to the training data. Memory based Nearest Neighbor Methods.
13 Background for the Solution KNearest Neighbor (KNN) method. Memory Based method. Measures Distance between the query instance and every instance in the training set. Find the K training instances with the least distance from query instance. Using these K instances, average their rating for this movie for these training instances. Distances can be measured using the following formulae.
14 Background for the Solution Distance formulae. Distance Formula Manhattan Distance features f 1 x i, f x j, f Euclidean Distance features f 1 ( x i, f x j, f ) 2 Minkowsky Distance p features ( x i x f 1 p, f j, f ) Mahalanobis Distance 1 ( x x ) ( x x ) i j i j T
15 Background for the Solution How important is distance measure? Curse of Dimensionality. Example: what if we were to characterize the movie by it actors, directors, writers, genre, and then all of its CREW? What is the problem? What if some attributes are more dominant than others? Example: Cost of home are much larger quantities than person s height.
16 Background of the Solution What if I was very conservative about my rating and someone else was too generous? I rate the movie I like the most as 3 and the least as 1. someone else rates his/her high at 5 and high at 3. So am I like this person? Difficult to say. We are comparing two people with very high personal biases. Which will result in obvious flawed similarity measure. Solution? Normalization of the data.
17 Background for the Solution Normalization What is that? How do we do it? How will it change my ratings? Won t I loose the original rating? We will calculate Mean rating for every user over the movies he / she has rated Also calculate standard deviation for the user s rating. From every rating we will subtract the user s mean rating and divide it by their standard deviation.
18 Background for the Solution Should all members of the neighborhood contribute equally to the prediction? Not always, we can argue that people who are similar to you, i.e. have least distance from you should contribute more than farther ones. This is done by weighing the prediction by the instance s distance from the query instance.
19 Background for the Solution Clustering Idea is to group the items together based on their attributes. Data is typically unlabeled. Similarity is measured using the distance between the two points. Example: Consider going in to a comic book shop and putting together comics from a pile of comics that are similar. Types: Partitional Clustering: KMeans Hierarchical clustering: Agglomerative Clustering
20 Background for the Solution KMeans clustering [MacQueen, 1967] Randomly select K instances as cluster centers. Label every data point with its nearest cluster centers. Recompute the cluster centers. Repeat the above two steps until no instances change clusters or certain iterations have gone by. How is it related to our discussion today?
21 K Nearest Neighbor Algorithm Clustering Based Nearest Neighbor Algorithm
22 Proposed Solution KNearest Neighbor approach (Overview) Given a query instance q(movieid, UserId) normalize the data before processing. Find the distance of this instance with all the users who rated this movie. Of the these users select the K users that are nearest to the query instance as its neighborhood. Average the rating of the users form this neighborhood for this particular movie. This is the predicted rating for the query instance.
23 Proposed Solution  Example Example: (Representative data, not real) Matrix Star Wars Dark knight Rocky Sita Aur Gita Star Trek Cliffhanger A.I. MI XMen Jim Sean John Sidd Penny Pete 5? 4 4
24 Proposed Solution  Example calculate the Mean and Standard Deviation vectors. meanrating standarddeviation Jim Sean John Sidd Penny Pete
25 Proposed Solution  Example Normalized data Matrix Star Wars Dark knight Rocky Sita Aur Gita Star Trek Cliffhanger A.I. MI XMen Jim Sean John Sidd Penny Pete 1.15?
26 Proposed Solution  Example So now we have a query instance q(pete, Sita Aur Gita) i.e. we wish to evaluate how much will Pete like movie Sita Aur Gita on a scale of 15. To do this we need to indentify Pete s two neighbors who rated this movie. (2NN case). Users who rated the movie Sita Aur Gita are. candidate_users Jim Sidd Penny
27 Proposed Solution  Example Users with their distance and the 2 neighbors in the neighborhood are Users Distance Jim Sidd Peny Nearest Neighbors are Jim and Sidd.
28 Proposed Solution  Example The average of the ratings by Jim and Sidd to movie Sita Aur Gita is So is our prediction correct? Not yet. This prediction is in normalized form. We need to bring it back to Pete s prediction level. How? Multiply by Standard Deviation of Pete s ratings. Add Pete mean rating to this product. ( * ) = So predicted rating for Pete is
29 Proposed Solution CKNN Clustering based Nearest Neighbor appraoch Obtain for every movie its genre from external sources. (IMDB in our case) Create for every user a vector representing each genre as one cell. In that cell we count number of movies that users has rated for the genre. (We have one such vector for each user.) Cluster the users as per the genres of the movie they have rated. Cluster centers of these clusters represent the collective opinion of the users in that cluster about the movies of that particular genre. We call them Super Users
30 Proposed Solution CKNN For each super user we predict rating of all the movies of that genre as the average of the ratings of the users that rated the movie. When presented with query point q(movieid, userid). We find all the genre for that movie. For each genre we calculate distance of the user from cluster centers for the genre. We select the nearest K cluster centers and average the rating of these cluster centers for the movie to predict movie rating for this genre. We average per genre predicted rating and get the predicted ratings for q.
31 Proposed Solution Example (CKNN) We use the data from our previous example. (recap) Matrix Star Wars Dark knight Rocky Sita Aur Gita Star Trek Cliffhanger A.I. MI XMen Jim Sean John Sidd Penny Pete 5? 4 4
32 Proposed Solution Example (CKNN) We find genre for every movie. Action Adventure Crime Drama Fantasy SciFi Sport Thriller Matrix Star Wars Dark Knight Rocky 1 1 Sita Aur Gita 1 1 Star Trek 1 1 Cliffhanger A.I MI XMen 1 1 1
33 Proposed Solution Example (CKNN) Convert User Movie Data to User Genre Action Adventure Crime Drama Fantasy SciFi Sport Thriller Jim Sean John Sidd Penny Pete
34 Proposed Solution Example (CKNN) We cluster users in to two clusters. Action Adventure Crime Drama Fantasy SciFi Sport Thriller Jim Sean John Sidd Penny Pete
35 Proposed Solution Example (CKNN) The query point as last time is q(pete, Sita Aur Gita) Per genre cluster look like (Genres of Sita Aur Gita ) Adventure Drama 1 2 Matrix Star Wars Sita Aur Gita 2 3 A.I. 5 Sita Aur Gita Star Trek 3 4 Cliffhanger A.I. 5 MI 2 1
36 Proposed Solution Example (CKNN) Distance of Pete from Cluster centers of Adventure 1 2 Pete Distance of Pete from Cluster centers of Drama Not applicable as Pete does not rate any movie from that genre. We try to find one (K=1) nearest cluster for Adventure genre. That is cluster two.
37 Proposed Solution Example (CKNN) Hence, the rating for the query point q(pete, Sita Aur Gita) calculated by taking the rating of cluster two of adventure genre. Our prediction is: 2 for this movie. What if Pete had rated one of the movies from drama genre? We would predict the rating for Drama genre as well for Pete Then, average the predicted rating for the two genre to get the final rating.
38
39 Experiments Setup Dataset used Netflix Prize Dataset. Experiments performed on Randomly selected 1121 movies covering users. These data instances are chosen form the probe file from the Netflix Dataset. We have the ratings for these instances in the training data. These instances are treated as Hold out set in the experiments.
40 Experiments Setup We normalize the data for the KNN method Predictions so done are converted back to the denormalized form We test the same set of movie, user pairs on both methods Standard KNearest Neighbor ClusteredKNearest Neighbor
41 Experiments  Setup This is a regression problem, hence we want to know if we are off the expected value, how off are we? Hence, Test Metric used is Root Mean Square Error (RMSE): Absolute Average Error (AAE): Time taken.
42 Experiments  Implementation KNN Implemented C / C++. Classes converted to Structure. Difficult to manage the massive dataset in the memory. Size of the program makes it difficult to run in C++ Comparison to every user needs a lot of fine tuning of the code to achieve a reasonable performance KNNs inherent problem Ease of implementation vs. speed is important trade off Using maps, vectors only adds storage speed added is negated by this.
43 Experiments  Implementation CKNN Implemented using Perl, Matlab, Python, MySQL. Perl s hashes of hashes came to rescue Ease of token / string processing was most helpful Complex logic hence easy to express in Perl (Regex help) Python Interfaces with IMDB (IMDbPY), MySQL has local database of IMDB. Matlab does the clustering (KMeans) Fine tuning of algorithm and ample available memory negates the slow / interpreted nature of the languages.
44 Experiments  Results Result on described dataset Method Absolute Average Error Root Mean Square Error Time (Minutes) KNN * CKNN Netflix (Ladder Board NA NA Topper) Netflix Current System 1 NA NA
45 Experiments  Results RMSE Comparisons Comparison of the RMSE and Absolute Average Error Time taken Time in Minutes RMSE Time in Minutes Absolu te Averag e Error KNN CKNN Netflix (Current Topper) Netflix (Current System) 0 KNN CKNN
46 Experiments  Results Distribution of the Absolute Average Error for KNN and CKNN methods Number of Movies with error for standard K NN method Number of Movies with error for CKNN method
47
48 Related Work Methods already applied to this problem are Matrix Factorization Methods Regularized Singular Value Decomposition [Paterek, 2007][Webb, 2007] Baises with Regularized SVD [Paterek, 2007] Probabilistic Latent Semantic Analysis plsa [(Hofmann, 2004] Nearest Neighbor Methods [Bell and Koren, 2007] Alternate Least Squares [Bell and Koren, 2007] Post processing of SVD features. [Paterek, 2007]
49
50 Future Work KNN method Different values of K could experimented Distributed processing of this problem Distance weighing the contributions from neighbors CKNN Trying different # of clusters Dates provided with the ratings could be used in clustering along with genre More information form IMDB or other sources might included Application of Movie clustering and then predicting the rating for users is also possible
51
52 Conclusions We presented results of two methods to solve the Netflix Prize Problem including a novel based clustering method First method, a standard KNearest Neighbor method although gets lower RMSE value is very slow in prediction A function of comparison with every user who rated this movie Second method, clusters the users based on the genre of the movies they rated and creates super users from these clusters
53 Conclusions Standard KNN method performs slightly better compared to the Clustering based method on the Root Mean Square Error metric but is extremely slow Our clustering based method has higher Root Mean Square Error than Standard KNN method but is extremely fast and practical for large scale method implementations It also shows promise of being accurate for many predictions
54
55 Atul S Kulkarni
CS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS6: Mining Massive Datasets Jure Leskovec, Stanford University http://cs6.stanford.edu Training data 00 million ratings, 80,000 users, 7,770 movies 6 years of data: 000 00 Test data Last few ratings of
More informationPerformance Comparison of Algorithms for Movie Rating Estimation
Performance Comparison of Algorithms for Movie Rating Estimation Alper Köse, Can Kanbak, Noyan Evirgen Research Laboratory of Electronics, Massachusetts Institute of Technology Department of Electrical
More informationCOMP 465: Data Mining Recommender Systems
//0 movies COMP 6: Data Mining Recommender Systems Slides Adapted From: www.mmds.org (Mining Massive Datasets) movies Compare predictions with known ratings (test set T)????? Test Data Set Rootmeansquare
More informationUse of KNN for the Netflix Prize Ted Hong, Dimitris Tsamis Stanford University
Use of KNN for the Netflix Prize Ted Hong, Dimitris Tsamis Stanford University {tedhong, dtsamis}@stanford.edu Abstract This paper analyzes the performance of various KNNs techniques as applied to the
More informationRecommendation Systems
Recommendation Systems CS 534: Machine Learning Slides adapted from Alex Smola, Jure Leskovec, Anand Rajaraman, Jeff Ullman, Lester Mackey, Dietmar Jannach, and Gerhard Friedrich Recommender Systems (RecSys)
More informationHow to predict IMDb score
How to predict IMDb score Jiawei Li A53226117 Computational Science, Mathematics and Engineering University of California San Diego jil206@ucsd.edu Abstract This report is based on the dataset provided
More informationGeneral Instructions. Questions
CS246: Mining Massive Data Sets Winter 2018 Problem Set 2 Due 11:59pm February 8, 2018 Only one late period is allowed for this homework (11:59pm 2/13). General Instructions Submission instructions: These
More informationProgress Report: Collaborative Filtering Using Bregman Coclustering
Progress Report: Collaborative Filtering Using Bregman Coclustering Wei Tang, Srivatsan Ramanujam, and Andrew Dreher April 4, 2008 1 Introduction Analytics are becoming increasingly important for business
More informationJeff Howbert Introduction to Machine Learning Winter
Collaborative Filtering Nearest es Neighbor Approach Jeff Howbert Introduction to Machine Learning Winter 2012 1 Bad news Netflix Prize data no longer available to public. Just after contest t ended d
More informationReddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011
Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011 1. Introduction Reddit is one of the most popular online social news websites with millions
More informationIntroduction. Chapter Background Recommender systems Collaborative based filtering
ii Abstract Recommender systems are used extensively today in many areas to help users and consumers with making decisions. Amazon recommends books based on what you have previously viewed and purchased,
More informationData Mining Techniques
Data Mining Techniques CS 6  Section  Spring 7 Lecture JanWillem van de Meent (credit: Andrew Ng, Alex Smola, Yehuda Koren, Stanford CS6) Project Project Deadlines Feb: Form teams of  people 7 Feb:
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationCollaborative Filtering for Netflix
Collaborative Filtering for Netflix Michael Percy Dec 10, 2009 Abstract The Netflix movierecommendation problem was investigated and the incremental Singular Value Decomposition (SVD) algorithm was implemented
More informationA Recommender System. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2018
A Recommender System John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center Copyright 2018 Obvious Applications We are now advanced enough that we can aspire to a serious application.
More informationProperty1 Property2. by Elvir Sabic. Recommender Systems Seminar Prof. Dr. Ulf Brefeld TU Darmstadt, WS 2013/14
Property1 Property2 by Recommender Systems Seminar Prof. Dr. Ulf Brefeld TU Darmstadt, WS 2013/14 ContentBased Introduction Pros and cons Introduction Concept 1/30 Property1 Property2 2/30 Based on item
More informationRecommender Systems. Techniques of AI
Recommender Systems Techniques of AI Recommender Systems User ratings Collect user preferences (scores, likes, purchases, views...) Find similarities between items and/or users Predict user scores for
More informationRecommender Systems. Collaborative Filtering & ContentBased Recommending
Recommender Systems Collaborative Filtering & ContentBased Recommending 1 Recommender Systems Systems for recommending items (e.g. books, movies, CD s, web pages, newsgroup messages) to users based on
More informationarxiv: v4 [cs.ir] 28 Jul 2016
ReviewBased Rating Prediction arxiv:1607.00024v4 [cs.ir] 28 Jul 2016 Tal Hadad Dept. of Information Systems Engineering, BenGurion University Email: tah@post.bgu.ac.il Abstract Recommendation systems
More informationExtension Study on ItemBased PTree Collaborative Filtering Algorithm for Netflix Prize
Extension Study on ItemBased PTree Collaborative Filtering Algorithm for Netflix Prize Tingda Lu, Yan Wang, William Perrizo, Amal Perera, Gregory Wettstein Computer Science Department North Dakota State
More informationCS249: ADVANCED DATA MINING
CS249: ADVANCED DATA MINING Recommender Systems II Instructor: Yizhou Sun yzsun@cs.ucla.edu May 31, 2017 Recommender Systems Recommendation via Information Network Analysis Hybrid Collaborative Filtering
More informationData Mining Concepts & Tasks
Data Mining Concepts & Tasks Duen Horng (Polo) Chau Georgia Tech CSE6242 / CX4242 Sept 9, 2014 Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos Last Time
More informationPart 12: Advanced Topics in Collaborative Filtering. Francesco Ricci
Part 12: Advanced Topics in Collaborative Filtering Francesco Ricci Content Generating recommendations in CF using frequency of ratings Role of neighborhood size Comparison of CF with association rules
More informationMatrixVector Multiplication by MapReduce. From Rajaraman / Ullman Ch.2 Part 1
MatrixVector Multiplication by MapReduce From Rajaraman / Ullman Ch.2 Part 1 Google implementation of MapReduce created to execute very large matrixvector multiplications When ranking of Web pages that
More informationIntroduction to Machine Learning. Xiaojin Zhu
Introduction to Machine Learning Xiaojin Zhu jerryzhu@cs.wisc.edu Read Chapter 1 of this book: Xiaojin Zhu and Andrew B. Goldberg. Introduction to Semi Supervised Learning. http://www.morganclaypool.com/doi/abs/10.2200/s00196ed1v01y200906aim006
More information5/13/2009. Introduction. Introduction. Introduction. Introduction. Introduction
Applying Collaborative Filtering Techniques to Movie Search for Better Ranking and Browsing SeungTaek Park and David M. Pennock (ACM SIGKDD 2007) Two types of technologies are widely used to overcome
More informationDimension Reduction CS534
Dimension Reduction CS534 Why dimension reduction? High dimensionality large number of features E.g., documents represented by thousands of words, millions of bigrams Images represented by thousands of
More informationChapter 6: Cluster Analysis
Chapter 6: Cluster Analysis The major goal of cluster analysis is to separate individual observations, or items, into groups, or clusters, on the basis of the values for the q variables measured on each
More informationComparison of Recommender System Algorithms focusing on the NewItem and UserBias Problem
Comparison of Recommender System Algorithms focusing on the NewItem and UserBias Problem Stefan Hauger 1, Karen H. L. Tso 2, and Lars SchmidtThieme 2 1 Department of Computer Science, University of
More informationFeature Extractors. CS 188: Artificial Intelligence Fall NearestNeighbor Classification. The Perceptron Update Rule.
CS 188: Artificial Intelligence Fall 2007 Lecture 26: Kernels 11/29/2007 Dan Klein UC Berkeley Feature Extractors A feature extractor maps inputs to feature vectors Dear Sir. First, I must solicit your
More informationWeb Personalization & Recommender Systems
Web Personalization & Recommender Systems COSC 488 Slides are based on:  Bamshad Mobasher, Depaul University  Recent publications: see the last page (Reference section) Web Personalization & Recommender
More informationUnsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIITDelhi
Unsupervised Learning Presenter: Anil Sharma, PhD Scholar, IIITDelhi Content Motivation Introduction Applications Types of clustering Clustering criterion functions Distance functions Normalization Which
More informationCollaborative Filtering based on User Trends
Collaborative Filtering based on User Trends Panagiotis Symeonidis, Alexandros Nanopoulos, Apostolos Papadopoulos, and Yannis Manolopoulos Aristotle University, Department of Informatics, Thessalonii 54124,
More informationMovieNet: A Social Network for Movie Enthusiasts
MovieNet: A Social Network for Movie Enthusiasts 445 Course Project MovieNet is a social network for movie enthusiasts, containing a database of movies, actors/actresses, directors, etc., and a social
More informationWeb Personalization & Recommender Systems
Web Personalization & Recommender Systems COSC 488 Slides are based on:  Bamshad Mobasher, Depaul University  Recent publications: see the last page (Reference section) Web Personalization & Recommender
More informationDimension reduction : PCA and Clustering
Dimension reduction : PCA and Clustering By Hanne Jarmer Slides by Christopher Workman Center for Biological Sequence Analysis DTU The DNA Array Analysis Pipeline Array design Probe design Question Experimental
More informationYelp Recommendation System
Yelp Recommendation System Jason Ting, Swaroop Indra Ramaswamy Institute for Computational and Mathematical Engineering Abstract We apply principles and techniques of recommendation systems to develop
More informationFactor in the Neighbors: Scalable and Accurate Collaborative Filtering
1 Factor in the Neighbors: Scalable and Accurate Collaborative Filtering YEHUDA KOREN Yahoo! Research Recommender systems provide users with personalized suggestions for products or services. These systems
More informationUnsupervised Learning
Networks for Pattern Recognition, 2014 Networks for Single Linkage KMeans Soft DBSCAN PCA Networks for Kohonen Maps Linear Vector Quantization Networks for Problems/Approaches in Machine Learning Supervised
More informationUsing Machine Learning to Optimize Storage Systems
Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation
More informationLargescale visual recognition Efficient matching
Largescale visual recognition Efficient matching Florent Perronnin, XRCE Hervé Jégou, INRIA CVPR tutorial June 16, 2012 Outline!! Preliminary!! Locality Sensitive Hashing: the two modes!! Hashing!! Embedding!!
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationCluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010
Cluster Analysis Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX 7575 April 008 April 010 Cluster Analysis, sometimes called data segmentation or customer segmentation,
More informationCS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample
CS 1675 Introduction to Machine Learning Lecture 18 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem:
More informationJune 15, Abstract. 2. Methodology and Considerations. 1. Introduction
Organizing Internet Bookmarks using Latent Semantic Analysis and Intelligent Icons Note: This file is a homework produced by two students for UCR CS235, Spring 06. In order to fully appreacate it, it may
More informationClustering. Lecture 6, 1/24/03 ECS289A
Clustering Lecture 6, 1/24/03 What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe illposed
More informationKapitel 4: Clustering
LudwigMaximiliansUniversität München Institut für Informatik Lehr und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases WiSe 2017/18 Kapitel 4: Clustering Vorlesung: Prof. Dr.
More informationCaseBased Reasoning. CS 188: Artificial Intelligence Fall NearestNeighbor Classification. Parametric / Nonparametric.
CS 188: Artificial Intelligence Fall 2008 Lecture 25: Kernels and Clustering 12/2/2008 Dan Klein UC Berkeley CaseBased Reasoning Similarity for classification Casebased reasoning Predict an instance
More informationCPSC 340: Machine Learning and Data Mining. Kernel Trick Fall 2017
CPSC 340: Machine Learning and Data Mining Kernel Trick Fall 2017 Admin Assignment 3: Due Friday. Midterm: Can view your exam during instructor office hours or after class this week. Digression: the other
More informationMeasure of Distance. We wish to define the distance between two objects Distance metric between points:
Measure of Distance We wish to define the distance between two objects Distance metric between points: Euclidean distance (EUC) Manhattan distance (MAN) Pearson sample correlation (COR) Angle distance
More informationNearest Neighbor Classification
Nearest Neighbor Classification Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 1 / 48 Outline 1 Administration 2 First learning algorithm: Nearest
More informationDistributionfree Predictive Approaches
Distributionfree Predictive Approaches The methods discussed in the previous sections are essentially modelbased. Modelfree approaches such as treebased classification also exist and are popular for
More informationDoes Wikipedia Information Help Netflix Predictions?
Does Wikipedia Information Help Netflix Predictions? John LeesMiller, Fraser Anderson, Bret Hoehn, Russell Greiner University of Alberta Department of Computing Science {leesmill, frasera, hoehn, greiner}@cs.ualberta.ca
More informationClustering. Bruno Martins. 1 st Semester 2012/2013
Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2012/2013 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti. Outline 1 Motivation Basic Concepts
More informationLab # 2  ACS I Part I  DATA COMPRESSION in IMAGE PROCESSING using SVD
Lab # 2  ACS I Part I  DATA COMPRESSION in IMAGE PROCESSING using SVD Goals. The goal of the first part of this lab is to demonstrate how the SVD can be used to remove redundancies in data; in this example
More informationIntroduction to Artificial Intelligence
Introduction to Artificial Intelligence COMP307 Machine Learning 2: 3K Techniques Yi Mei yi.mei@ecs.vuw.ac.nz 1 Outline KNearest Neighbour method Classification (Supervised learning) Basic NN (1NN)
More informationCollaborative Filtering using Weighted BiPartite Graph Projection A Recommendation System for Yelp
Collaborative Filtering using Weighted BiPartite Graph Projection A Recommendation System for Yelp Sumedh Sawant sumedh@stanford.edu Team 38 December 10, 2013 Abstract We implement a personal recommendation
More informationDocument Clustering: Comparison of Similarity Measures
Document Clustering: Comparison of Similarity Measures Shouvik Sachdeva Bhupendra Kastore Indian Institute of Technology, Kanpur CS365 Project, 2014 Outline 1 Introduction The Problem and the Motivation
More informationCHAPTER 4 STOCK PRICE PREDICTION USING MODIFIED KNEAREST NEIGHBOR (MKNN) ALGORITHM
CHAPTER 4 STOCK PRICE PREDICTION USING MODIFIED KNEAREST NEIGHBOR (MKNN) ALGORITHM 4.1 Introduction Nowadays money investment in stock market gains major attention because of its dynamic nature. So the
More informationRecommender system techniques applied to Netflix movie data
Recommender system techniques applied to Netflix movie data Research Paper Business Analytics Steven Postmus (s.h.postmus@student.vu.nl) Supervisor: Sandjai Bhulai (s.bhulai@vu.nl) Vrije Universiteit Amsterdam,
More information7. Nearest neighbors. Learning objectives. Centre for Computational Biology, Mines ParisTech
Foundations of Machine Learning CentraleSupélec Paris Fall 2016 7. Nearest neighbors ChloéAgathe Azencot Centre for Computational Biology, Mines ParisTech chloeagathe.azencott@minesparistech.fr Learning
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning ChoJui Hsieh UC Davis Sept 22, 2016 Course Information Website: http://www.stat.ucdavis.edu/~chohsieh/teaching/ ECS289G_Fall2016/main.html My office: Mathematical Sciences
More informationPredicting Popular Xbox games based on Search Queries of Users
1 Predicting Popular Xbox games based on Search Queries of Users Chinmoy Mandayam and Saahil Shenoy I. INTRODUCTION This project is based on a completed Kaggle competition. Our goal is to predict which
More informationLecture MapReduce. Algorithms. By Marina Barsky Winter 2017, University of Toronto
Lecture 04.02 MapReduce Algorithms By Marina Barsky Winter 2017, University of Toronto Example 1: Language Model Statistical machine translation: Need to count number of times every 5word sequence occurs
More informationThe exam is closed book, closed notes except your onepage (twosided) cheat sheet.
CS 189 Spring 2015 Introduction to Machine Learning Final You have 2 hours 50 minutes for the exam. The exam is closed book, closed notes except your onepage (twosided) cheat sheet. No calculators or
More informationData Mining for Web Personalization
Data Mining for Web Personalization Patrick Dudas Outline Personalization Data mining Examples Web mining MapReduce Data Preprocessing Knowledge Discovery Evaluation Information High 1 Personalization
More informationBig Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising
More informationPersonalized Web Search
Personalized Web Search Dhanraj Mavilodan (dhanrajm@stanford.edu), Kapil Jaisinghani (kjaising@stanford.edu), Radhika Bansal (radhika3@stanford.edu) Abstract: With the increase in the diversity of contents
More informationClustering Part 4 DBSCAN
Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of
More informationCS2941 Assignment 2 Report
CS2941 Assignment 2 Report Keling Chen and Huasha Zhao February 24, 2012 1 Introduction The goal of this homework is to predict a users numeric rating for a book from the text of the user s review. The
More informationA P2P REcommender system based on Gossip Overlays (PREGO)
10 th IEEE INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION TECHNOLOGY Bradford,UK, 29 June  1 July, 2010 Ranieri Baraglia, Patrizio Dazzi, Matteo Mordacchini ISTI,CNR, Pisa,Italy Laura Ricci University
More informationAnalyzing Outlier Detection Techniques with Hybrid Method
Analyzing Outlier Detection Techniques with Hybrid Method Shruti Aggarwal Assistant Professor Department of Computer Science and Engineering Sri Guru Granth Sahib World University. (SGGSWU) Fatehgarh Sahib,
More informationRecommender Systems using Graph Theory
Recommender Systems using Graph Theory Vishal Venkatraman * School of Computing Science and Engineering vishal2010@vit.ac.in Swapnil Vijay School of Computing Science and Engineering swapnil2010@vit.ac.in
More informationClustering kmean clustering
Clustering kmean clustering Genome 373 Genomic Informatics Elhanan Borenstein The clustering problem: partition genes into distinct sets with high homogeneity and high separation Clustering (unsupervised)
More informationCluster Analysis. Ying Shen, SSE, Tongji University
Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group
More informationCreating a Recommender System. An Elasticsearch & Apache Spark approach
Creating a Recommender System An Elasticsearch & Apache Spark approach My Profile SKILLS Álvaro Santos Andrés Big Data & Analytics Solution Architect in Ericsson with more than 12 years of experience focused
More informationPredictive Indexing for Fast Search
Predictive Indexing for Fast Search Sharad Goel, John Langford and Alex Strehl Yahoo! Research, New York Modern Massive Data Sets (MMDS) June 25, 2008 Goel, Langford & Strehl (Yahoo! Research) Predictive
More informationChapter 4: NonParametric Techniques
Chapter 4: NonParametric Techniques Introduction Density Estimation Parzen Windows KnNearest Neighbor Density Estimation KNearest Neighbor (KNN) Decision Rule Supervised Learning How to fit a density
More informationJarek Szlichta
Jarek Szlichta http://data.science.uoit.ca/ Approximate terminology, though there is some overlap: Data(base) operations Executing specific operations or queries over data Data mining Looking for patterns
More informationUnit 8 Algebra 1. Name:
Unit 8 Algebra 1 Name: Concepts: Correlations Graphing Scatter Plots BestFitting Liens [calculator key strokes] 4.5 Correlation and BestFitting Lines Correlations We use to tell if there is a relationship
More informationData Preprocessing. Javier Béjar AMLT /2017 CS  MAI. (CS  MAI) Data Preprocessing AMLT / / 71 BY: $\
Data Preprocessing S  MAI AMLT  2016/2017 (S  MAI) Data Preprocessing AMLT  2016/2017 1 / 71 Outline 1 Introduction Data Representation 2 Data Preprocessing Outliers Missing Values Normalization Discretization
More informationHow to use FSBforecast Excel add in for regression analysis
How to use FSBforecast Excel add in for regression analysis FSBforecast is an Excel add in for data analysis and regression that was developed here at the Fuqua School of Business over the last 3 years
More informationCS570: Introduction to Data Mining
CS570: Introduction to Data Mining Classification Advanced Reading: Chapter 8 & 9 Han, Chapters 4 & 5 Tan Anca DolocMihu, Ph.D. Slides courtesy of Li Xiong, Ph.D., 2011 Han, Kamber & Pei. Data Mining.
More informationClustering Results. Result List Example. Clustering Results. Information Retrieval
Information Retrieval INFO 4300 / CS 4300! Presenting Results Clustering Clustering Results! Result lists often contain documents related to different aspects of the query topic! Clustering is used to
More informationAllstate Insurance Claims Severity: A Machine Learning Approach
Allstate Insurance Claims Severity: A Machine Learning Approach Rajeeva Gaur SUNet ID: rajeevag Jeff Pickelman SUNet ID: pattern Hongyi Wang SUNet ID: hongyiw I. INTRODUCTION The insurance industry has
More informationTopic 7 Machine learning
CSE 103: Probability and statistics Winter 2010 Topic 7 Machine learning 7.1 Nearest neighbor classification 7.1.1 Digit recognition Countless pieces of mail pass through the postal service daily. A key
More informationCPSC 340: Machine Learning and Data Mining. Finding Similar Items Fall 2017
CPSC 340: Machine Learning and Data Mining Finding Similar Items Fall 2017 Assignment 1 is due tonight. Admin 1 late day to hand in Monday, 2 late days for Wednesday. Assignment 2 will be up soon. Start
More informationLatent Semantic Indexing
Latent Semantic Indexing Thanks to Ian Soboroff Information Retrieval 1 Issues: Vector Space Model Assumes terms are independent Some terms are likely to appear together synonyms, related words spelling
More informationADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL
ADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY BHARAT SIGINAM IN
More informationSemiAutomatic Transcription Tool for Ancient Manuscripts
The Venice Atlas A Digital Humanities atlas project by DH101 EPFL Students SemiAutomatic Transcription Tool for Ancient Manuscripts In this article, we investigate various techniques from the fields of
More informationSlides based on those in:
Spyros Kontogiannis & Christos Zaroliagis Slides based on those in: http://www.mmds.org A 3.3 B 38.4 C 34.3 D 3.9 E 8.1 F 3.9 1.6 1.6 1.6 1.6 1.6 2 y 0.8 ½+0.2 ⅓ M 1/2 1/2 0 0.8 1/2 0 0 + 0.2 0 1/2 1 [1/N]
More informationLesson 3. Prof. Enza Messina
Lesson 3 Prof. Enza Messina Clustering techniques are generally classified into these classes: PARTITIONING ALGORITHMS Directly divides data points into some prespecified number of clusters without a hierarchical
More informationMethods for Intelligent Systems
Methods for Intelligent Systems Lecture Notes on Clustering (II) Davide Eynard eynard@elet.polimi.it Department of Electronics and Information Politecnico di Milano Davide Eynard  Lecture Notes on Clustering
More informationExploratory data analysis for microarrays
Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D66123 Saarbrücken Germany NGFN  Courses in Practical DNA
More informationMATH36032 Problem Solving by Computer. Data Science
MATH36032 Problem Solving by Computer Data Science NO. of jobs on jobsite 1 10000 NO. of Jobs 8000 6000 4000 2000 MATLAB Data Data Science 0 Jan 2016 Jul 2016 Jan 2017 1 http://www.jobsite.co.uk/ What
More informationBeing Prepared In A Sparse World: The Case of KNN Graph Construction. Antoine Boutet DRIM LIRIS, Lyon
Being Prepared In A Sparse World: The Case of KNN Graph Construction Antoine Boutet DRIM LIRIS, Lyon Coauthors Joint work with François Taiani Nupur Mittal AnneMarie Kermarrec Published at ICDE 2016
More informationCHAPTER 2 DESCRIPTIVE STATISTICS
CHAPTER 2 DESCRIPTIVE STATISTICS 1. StemandLeaf Graphs, Line Graphs, and Bar Graphs The distribution of data is how the data is spread or distributed over the range of the data values. This is one of
More informationApplication of Dimensionality Reduction in Recommender System  A Case Study
Application of Dimensionality Reduction in Recommender System  A Case Study Badrul M. Sarwar, George Karypis, Joseph A. Konstan, John T. Riedl Department of Computer Science and Engineering / Army HPC
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Computer Science 591Y Department of Computer Science University of Massachusetts Amherst February 3, 2005 Topics Tasks (Definition, example, and notes) Classification
More informationProf. David Yarowsky
DATABASES (600315 and 600415) Prof David Yarowsky Department of Computer Science Johns Hopkins University yarowsky@gmailcom August 28, 2014 600315/415  DATABASES Instructor: Prof David Yarowsky TAs: Hackerman
More information