Distributed Itembased Collaborative Filtering with Apache Mahout. Sebastian Schelter twitter.com/sscdotopen. 7.

Similar documents
Machine Learning using MapReduce

Recommender Systems. Techniques of AI

Recommendation Algorithms: Collaborative Filtering. CSE 6111 Presentation Advanced Algorithms Fall Presented by: Farzana Yasmeen

Matrix-Vector Multiplication by MapReduce. From Rajaraman / Ullman- Ch.2 Part 1

Thanks to Jure Leskovec, Anand Rajaraman, Jeff Ullman

Recommender System. What is it? How to build it? Challenges. R package: recommenderlab

COMP6237 Data Mining Making Recommendations. Jonathon Hare

EPL451: Data Mining on the Web Lab 6

Jeff Howbert Introduction to Machine Learning Winter

System For Product Recommendation In E-Commerce Applications

Introduction to Data Mining

Data Mining Lecture 2: Recommender Systems

Recommender Systems - Introduction. Data Mining Lecture 2: Recommender Systems

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Infinite data. Filtering data streams

Introduction to Data Mining and Data Analytics

Example : Write a Hadoop MapReduce program for Movie Recommendation System.

Collective Intelligence in Action

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Part 11: Collaborative Filtering. Francesco Ricci

AMAZON.COM RECOMMENDATIONS ITEM-TO-ITEM COLLABORATIVE FILTERING PAPER BY GREG LINDEN, BRENT SMITH, AND JEREMY YORK

CS290N Summary Tao Yang

Mining Web Data. Lijun Zhang

Multi-domain Predictive AI. Correlated Cross-Occurrence with Apache Mahout and GPUs

Knowledge Discovery and Data Mining 1 (VO) ( )

Mahout in Action MANNING ROBIN ANIL SEAN OWEN TED DUNNING ELLEN FRIEDMAN. Shelter Island

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Building a Recommendation System for EverQuest Landmark s Marketplace

Data Mining: Decision Trees

Data Mining: Scoring (Linear Regression)

Collaborative Filtering using Euclidean Distance in Recommendation Engine

BBS654 Data Mining. Pinar Duygulu

Web Personalization & Recommender Systems

CSE 454 Final Report TasteCliq

Comparative performance of opensource recommender systems

arxiv: v4 [cs.ir] 28 Jul 2016

Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science. September 21, 2017

SUGGEST. Top-N Recommendation Engine. Version 1.0. George Karypis

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

Web Personalization & Recommender Systems

Property1 Property2. by Elvir Sabic. Recommender Systems Seminar Prof. Dr. Ulf Brefeld TU Darmstadt, WS 2013/14

Part 12: Advanced Topics in Collaborative Filtering. Francesco Ricci

CS249: ADVANCED DATA MINING

RECOMMENDATION SYSTEM USING COLLABORATIVE FILTERING

Implementation of Parallel CASINO Algorithm Based on MapReduce. Li Zhang a, Yijie Shi b

Mining Web Data. Lijun Zhang

CS 124/LINGUIST 180 From Languages to Information

CS 124/LINGUIST 180 From Languages to Information

Chee Kiam. to sieve through. and the next one. relevant. The advances in Big. (NLB) of Singapore.

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa

Text Classification Using Mahout

Parallel learning of content recommendations using map- reduce

Data Processing on Large Clusters. By: Stephen Cardina

MapReduce Design Patterns

Machine Learning in Action

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem

Music Recommendation with Implicit Feedback and Side Information

Part I: Data Mining Foundations

Hybrid Recommendation System Using Clustering and Collaborative Filtering

Recommender Systems. Collaborative Filtering & Content-Based Recommending

Collaborative Filtering and Recommender Systems. Definitions. .. Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..

CS435 Introduction to Big Data Spring 2018 Colorado State University. 3/21/2018 Week 10-B Sangmi Lee Pallickara. FAQs. Collaborative filtering

Information Retrieval: Retrieval Models

Neighborhood-Based Collaborative Filtering

Introduction to Hadoop and MapReduce

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Epilog: Further Topics

We extend SVM s in order to support multi-class classification problems. Consider the training dataset

CS 124/LINGUIST 180 From Languages to Information

A Time-based Recommender System using Implicit Feedback

Extension Study on Item-Based P-Tree Collaborative Filtering Algorithm for Netflix Prize

Using Data Mining to Determine User-Specific Movie Ratings

Specialist ICT Learning

An Open Source Discussion Group Recommendation System. A Project Presented to. The Faculty of the Department of Computer Science

Part 11: Collaborative Filtering. Francesco Ricci

Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Real-time Calculating Over Self-Health Data Using Storm Jiangyong Cai1, a, Zhengping Jin2, b

Implicit Documentation

Databases 2 (VU) ( / )

Study and Analysis of Recommendation Systems for Location Based Social Network (LBSN)

Big Data with Hadoop Ecosystem

Recommender Systems - Content, Collaborative, Hybrid

Overview. Data-mining. Commercial & Scientific Applications. Ongoing Research Activities. From Research to Technology Transfer

ECLT 5810 Clustering

Data Mining Techniques

CS224W Project: Recommendation System Models in Product Rating Predictions

Predicting User Ratings Using Status Models on Amazon.com

Mining of Massive Datasets

Mining Distributed Frequent Itemset with Hadoop

[3-5] Consider a Combiner that tracks local counts of followers and emits only the local top10 users with their number of followers.

TIM 50 - Business Information Systems

ECS289: Scalable Machine Learning

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Natural Language Processing

MURDOCH RESEARCH REPOSITORY

Recommendation System for Sports Videos

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

Seminar Collaborative Filtering. KDD Cup. Ziawasch Abedjan, Arvid Heise, Felix Naumann

Introduction to Big-Data

Transcription:

Distributed Itembased Collaborative Filtering with Apache Mahout Sebastian Schelter ssc@apache.org twitter.com/sscdotopen 7. October 2010

Overview 1. What is Apache Mahout? 2. Introduction to Collaborative Filtering 3. Itembased Collaborative Filtering 4. Computing similar items with Map/Reduce 5. Implementations in Mahout 6. Further information Sebastian Schelter: Distributed Itembased Collaborative Filtering with Apache Mahout 2

What is Apache Mahout? A scalable Machine Learning library scalable to reasonably large datasets (core algorithms implemented in Map/Reduce, runnable on Hadoop) scalable to support your business case (Apache License) scalable community Usecases Clustering (group items that are topically related) Classification (learn to assign categories to documents) Frequent Itemset Mining (find items that appear together) Recommendation Mining (find items a user might like) Sebastian Schelter: Distributed Itembased Collaborative Filtering with Apache Mahout 3

Recommendation Mining = Help users find items they might like Sebastian Schelter: Distributed Itembased Collaborative Filtering with Apache Mahout 4

Users, Items, Preferences Terminology users interact with items (books, videos, news, other users,...) preferences of each user towards a small subset of the items known (numeric or boolean) Algorithmic problems Prediction: Estimate the preference of a user towards an item he/she does not know Use Prediction for Top-N-recommendation: Find the N items a user might like best Sebastian Schelter: Distributed Itembased Collaborative Filtering with Apache Mahout 5

Explicit and Implicit Ratings Where do the preferences come from? Explicit Ratings users explictly express their preferences (e.g. ratings with stars) willingness of the users required Implicit Ratings interactions with items are interpreted as expressions of preference (e.g. purchasing a book, reading a news article) interactions must be detectable Sebastian Schelter: Distributed Itembased Collaborative Filtering with Apache Mahout 6

Collaborative Filtering How does it work? the past predicts the future: all predictions are derived from historical data (the preferences you already know) completely content agnostic very popular (e.g. used by Amazon, Google News) Mathematically user-item-matrix is created from the preference data task is to predict missing entries by finding patterns in the known entries Sebastian Schelter: Distributed Itembased Collaborative Filtering with Apache Mahout 7

A sample user-item-matrix The Matrix Alien Inception Alice 5 1 4 Bob? 2 5 Peter 4 3 2 Sebastian Schelter: Distributed Itembased Collaborative Filtering with Apache Mahout 8

Itembased Collaborative Filtering Algorithm neighbourhood-based approach works by finding similarly rated items in the user-item-matrix estimates a user's preference towards an item by looking at his/her preferences towards similar items Highly scalable item similarities tend to be relatively static, can be precomputed offline periodically less items than users in most scenarios looking at a small number of similar items is sufficient Sebastian Schelter: Distributed Itembased Collaborative Filtering with Apache Mahout 9

Example Similarity of The Matrix and Inception rating vector of The Matrix : (5,-,4) rating vector of Inception : (4,5,2) isolate all cooccurred ratings (all cases where a user rated both items) pick a similarity measure to compute a similarity value between -1 and 1 e.g. Pearson-Correlation 5 4-5 4 2 corr i, j = u U R u, i R i R u, j R j u U R u,i R i u U R u, j R j =0.47 Sebastian Schelter: Distributed Itembased Collaborative Filtering with Apache Mahout 10

Example Prediction: Estimate Bob's preference towards The Matrix look at all items that a) are similar to The Matrix b) have been rated by Bob => Alien, Inception estimate the unknown preference with a weighted sum P Bob, Matrix = s Matrix, Alien r Bob, Alien s Matrix, Inception r Bob, Inception s Matrix, Alien s Matrix, Inception =1.5 Sebastian Schelter: Distributed Itembased Collaborative Filtering with Apache Mahout 11

Algorithm in Map/Reduce How can we compute the similarities efficiently with Map/Reduce? Key ideas we can ignore pairs of items without a cooccurring rating we need to see all cooccurring ratings for each pair of items in the end Inspired by an algorithm designed to compute the pairwise similarity of text documents 5 4-5 4 2 Mahout's implementation is more generalized to be usable with other similarity measures, see DistributedVectorSimilarity and RowSimilarityJob for more details Sebastian Schelter: Distributed Itembased Collaborative Filtering with Apache Mahout 12

Algorithm in Map/Reduce - Pass 1 Map - make user the key (Alice,Matrix,5) (Alice,Alien,1) (Alice,Inception,4) (Bob,Alien,2) (Bob,Inception,5) (Peter,Matrix,4) (Peter,Alien,3) (Peter,Inception,2) Alice (Matrix,5) Alice (Alien,1) Alice (Inception,4) Bob (Alien,2) Bob (Inception,2) Peter (Matrix,4) Peter (Alien,3) Peter (Inception,2) Reduce - create inverted index Alice (Matrix,5) Alice (Alien,1) Alice (Inception,4) Bob (Alien,2) Bob (Inception,5) Peter (Matrix,4) Peter (Alien,3) Peter (Inception,2) Alice (Matrix,5)(Alien,1)(Inception,4) Bob (Alien,2)(Inception,5) Peter (Matrix,4)(Alien,3)(Inception,2) Sebastian Schelter: Distributed Itembased Collaborative Filtering with Apache Mahout 13

Algorithm in Map/Reduce - Pass 2 Map - emit all cooccurred ratings Alice (Matrix,5)(Alien,1) (Inception,4) Matrix,Alien (5,1) Matrix,Inception (5,4) Alien,Inception (1,4) Bob (Alien,2)(Inception,5) Alien,Inception (2,5) Peter (Matrix,4)(Alien,3) (Inception,2) Matrix,Alien (4,3) Matrix,Inception (4,2) Alien,Inception(3,2) Reduce - compute similarities Matrix,Alien (5,1) Matrix,Alien (4,3) Matrix,Inception (5,4) Matrix,Inception (4,2) Alien,Inception (1,4) Alien,Inception (2,5) Alien,Inception (3,2) Matrix,Alien ( 0.47) Matrix,Inception (0.47) Alien,Inception ( 0.63) Sebastian Schelter: Distributed Itembased Collaborative Filtering with Apache Mahout 14

Implementations in Mahout ItemSimilarityJob computes all item similarities various configuration options: similarity measure to use (e.g. cosine, Pearson-Correlation, Tanimoto-Coefficient, your own implementation) maximum number of similar items per item maximum number of cooccurrences considered... Input: preference data as CSV file, each line represents a single preference in the form userid,itemid,value Output: pairs of itemids with their associated similarity value Sebastian Schelter: Distributed Itembased Collaborative Filtering with Apache Mahout 15

Implementations in Mahout RecommenderJob Distributed Itembased Recommender various configuration options: similarity measure to use number of recommendations per user filter out some users or items... Input: the preference data as CSV file, each line contains a preference in the form userid,itemid,value Output: userids with associated recommended itemids and their scores Sebastian Schelter: Distributed Itembased Collaborative Filtering with Apache Mahout 16

Further information Mahout's website, wiki and mailinglist http://mahout.apache.org user@mahout.apache.org Mahout in Action, available through Manning's Early Access Program http://manning.com/owen B. Sarwar et al: Itembased collaborative filtering recommendation algorithms, 2001 T. Elsayed et al: Pairwise document similarity in large collections with MapReduce, 2008 Sebastian Schelter: Distributed Itembased Collaborative Filtering with Apache Mahout 17