Content-based Recommender Systems

Similar documents
Preference Learning in Recommender Systems

Lecture 7: Relevance Feedback and Query Expansion

Search Engines. Information Retrieval in Practice

Classification. 1 o Semestre 2007/2008

WordNet-based User Profiles for Semantic Personalization

Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _

Chapter 27 Introduction to Information Retrieval and Web Search

VECTOR SPACE CLASSIFICATION

Outline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity

- Content-based Recommendation -

Clustering. Bruno Martins. 1 st Semester 2012/2013

Document Clustering for Mediated Information Access The WebCluster Project

Reading group on Ontologies and NLP:

CSCI 599: Applications of Natural Language Processing Information Retrieval Retrieval Models (Part 1)"

Mining Web Data. Lijun Zhang

Naïve Bayes for text classification

Classification and Clustering

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER

Computer Vision. Exercise Session 10 Image Categorization

Chapter 6: Information Retrieval and Web Search. An introduction

Keyword Extraction by KNN considering Similarity among Features

Unsupervised Learning

Improving Retrieval Experience Exploiting Semantic Representation of Documents

Evaluation Methods for Focused Crawling

Part 12: Advanced Topics in Collaborative Filtering. Francesco Ricci

Text Categorization (I)

Recommender Systems (RSs)

Mining Web Data. Lijun Zhang

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems

Generative and discriminative classification techniques

Chapter 3: Supervised Learning

Introduction to Information Retrieval

CS60092: Information Retrieval

Entity Linking. David Soares Batista. November 11, Disciplina de Recuperação de Informação, Instituto Superior Técnico

Generative and discriminative classification techniques

Machine Learning. Supervised Learning. Manfred Huber

Birkbeck (University of London)

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

Information Retrieval and Web Search

Knowledge Discovery and Data Mining 1 (VO) ( )

Collaborative Filtering using Euclidean Distance in Recommendation Engine

Available online at ScienceDirect. Procedia Technology 17 (2014 )

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s

A Taxonomy of Semi-Supervised Learning Algorithms

Information Retrieval

Based on Raymond J. Mooney s slides

Question Answering Systems

CS371R: Final Exam Dec. 18, 2017

Recap of the last lecture. CS276A Text Retrieval and Mining. Text Categorization Examples. Categorization/Classification. Text Classification

Building Search Applications

Developing Focused Crawlers for Genre Specific Search Engines

A hybrid method to categorize HTML documents

Semi-Supervised Clustering with Partial Background Information

Weight adjustment schemes for a centroid based classifier

Part I: Data Mining Foundations

Social Search Networks of People and Search Engines. CS6200 Information Retrieval

Machine Learning: Algorithms and Applications Mockup Examination

Modern Information Retrieval

Problem 1 (20 pt) Answer the following questions, and provide an explanation for each question.

Predicting Gene Function and Localization

Multimedia Information Systems

Information Retrieval

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Introduction to Text Mining. Hongning Wang

Multi-label classification using rule-based classifier systems

Conceptual, Impact-Based Publications Recommendations

Collective Intelligence in Action

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

BaggTaming Learning from Wild and Tame Data

Machine Learning in Biology

Combining Text Embedding and Knowledge Graph Embedding Techniques for Academic Search Engines

User Preference Modeling - A Survey Report. Hamza Hydri Syed, Periklis Andritsos. Technical Report # DIT

Chapter 9. Classification and Clustering

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University

Optimal Query. Assume that the relevant set of documents C r. 1 N C r d j. d j. Where N is the total number of documents.

Multi-Stage Rocchio Classification for Large-scale Multilabeled

Representation of Documents and Infomation Retrieval

Machine Learning. Chao Lan

Modern Information Retrieval

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

ADVANCED CLASSIFICATION TECHNIQUES

Geographical Classification of Documents Using Evidence from Wikipedia

Mixture Models and the EM Algorithm

Weight Adjustment Schemes For a Centroid Based Classifier. Technical Report

Jeff Howbert Introduction to Machine Learning Winter

UNSUPERVISED STATIC DISCRETIZATION METHODS IN DATA MINING. Daniela Joiţa Titu Maiorescu University, Bucharest, Romania

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Beyond Content-Based Retrieval:

Table Of Contents: xix Foreword to Second Edition

Problem 1: Complexity of Update Rules for Logistic Regression

Support Vector Machines + Classification for IR

Information Retrieval. (M&S Ch 15)

What Causes My Test Alarm? Automatic Cause Analysis for Test Alarms in System and Integration Testing

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

Ensemble Learning. Another approach is to leverage the algorithms we have via ensemble methods

Micro-blogging Sentiment Analysis Using Bayesian Classification Methods

Instance and case-based reasoning

Transcription:

Recuperação de Informação Doutoramento em Engenharia Informática e Computadores Instituto Superior Técnico Universidade Técnica de Lisboa

Bibliography Pasquale Lops, Marco de Gemmis, Giovanni Semeraro: : State of the Art and Trends. Recommender Systems Handbook: 73-105 (2011) Springer

Outline 1 2 3 4 5

The large amounts of information available in the Internet.

The large amounts of information available in the Internet. The information in the Internet is dynamic and heterogeneous.

The large amounts of information available in the Internet. The information in the Internet is dynamic and heterogeneous. Personalizing the access to the available information is important.

The large amounts of information available in the Internet. The information in the Internet is dynamic and heterogeneous. Personalizing the access to the available information is important. Recommendation systems Research in recommendation emerged from the information retrieval research in the mid-90s.

Recommendation paradigms Collaborative filters Collaborative filters identify users with similar preferences and use this information to generate recommendations. Content-based filters Content-based filters try to recommend items similar to those a given user has liked in the past.

Collaborative filters User-based collaborative filtering Item-based collaborative filtering b1 b2 b3 b2 b6 b4 ui uj b1 b2 b3 recommendation b4 b1 b8 Items preferred together b3 b7 b5 Recommendations b1 b2 b3 b1 b5 uk ti ti+1 u User items

Content analyzer Profile learner User Feedback 1 2 Content analyzer Profile learner Probabilistic models and Naïve Bayes Relevance feedback and Rocchio s algorithm User Feedback 3 4

Content analyzer Profile learner User Feedback

Content-analyzer Outline Content analyzer Profile learner User Feedback

Content-analyzer Outline Content analyzer Profile learner User Feedback The content analyzer generates structured representations of the original items, e.g., documents, web pages news articles, product descriptions, etc. Item representation have two types: the keyword vector space model representations that include semantic knowledge

Content analyzer Profile learner User Feedback Item representation: Keyword-based vector space model Documents are represented by a n-dimensional vector, where each dimension corresponds to a term. Let D = {d 1, d 2,..., d N } the document set T = {t 1, t 2,..., t n } the term set document d j is represented by d j = {w 1,j, w 2,j,..., w n,j } where, w k,j is the weight of term k in document j typically, the weight is f k,j max z f z,j TF IDF (t k, d j ) = log N n k typically, similarity is measured using the cosine sim(d i, d j ) = Σ kw k,i w q q k,j Σw 2 k,i Σw 2 k,j

Content analyzer Profile learner User Feedback Document representation using semantic analysis Documents are represented as, Wordnet synset networks that are matched to user profile, also a synset network Wordnet synset vector space model using the concept of bag-of-synsets (BOS) high-dimensional space of concepts derived from Wikipedia

Profile learner Outline Content analyzer Profile learner User Feedback

Profile learner Outline Content analyzer Profile learner User Feedback Collects user documents, from Represented items repository, and user feedback. Then tries to generalize the collected data in order to build a profile. Methods for learning user profile: probabilistic methods, e.g., Naïve Bayes relevance feedback, Rocchio s algorithm decision trees nearest neighbor clustering

Probabilistic models Content analyzer Profile learner User Feedback Generates a probabilistic model based on previously observed data. Naïve Bayes observes the documents preferred by the user and calculates the parameters of the observed data, typically using multivariate Bernoulli event model multinomial event model estimates the a posteriori probability, P(c d) using P(c d) = P(c)P(c d) P(d) empirical results have shown that the multinomial event model outperforms the mutivariate Bernoulli

Relevance feedback Content analyzer Profile learner User Feedback Relevance feedback is a technique that consists of users feeding back into the system on the relevance of retrieved documents with respect to their information needs. Rocchio s algorithm the algorithm calculates the class vector c i =< ω 1,i, ω 2,i,..., ω T,i >, where ω k,i is the weight o term k in class i T is the vocabulary weights are calculated using ω k,j ω k,i = β Σ {dj POS i } POS i γ Σ {d j NEG i } ω k,j NEG i

Content analyzer Profile learner User Feedback

Content analyzer Profile learner User Feedback The filtering component matches user profile against document representation to generate a recommendation list of items for the active user. To find new documents, the filtering component searches for the documents that maximize P(c)P(c d d = argmax j ) dj P(d j ) when using Naïve Bayes classification compares documents that are similar to c i =< ω 1,i, ω 2,i,..., ω T,i > when using Rocchio s algorithm and generates the list of recommendations.

User feedback Outline Content analyzer Profile learner User Feedback The active user looks at the recommendation list and gives feedback to recommendation system. Implicit feedback preferences are collected without user explicit intervention user activities monitorized and analyzed documents bought/downloaded documents visualized documents bookmarked Explicit feedback user explicit feeds the systems with ratings three main approaches binary: like/deslike numeric ratings: 0-5 or totally dislike, moderate dislike, neutral, moderate like, totally like text comments

Advantages Drawbacks 1 2 3 Advantages Drawbacks 4 5

Advantages Outline Advantages Drawbacks Advantages User independence Content-based recommenders do not use ratings from other users. Transparency Explanations can be provided based on features. New Item New items do not need ratings to be recommended.

Drawbacks Outline Advantages Drawbacks Drawbacks Limited content analysis If documents are extremely short, e.g., jokes, content many not be enough to classify items. New user problem In order to get accurate recommendations, the user must have a enough ratings. The user is going to be recommended documents similar to those already rated by the user.

Novelty vs Serendipity Beyond over-specialization 1 2 3 4 Novelty vs Serendipity Beyond over-specialization 5

Novelty vs Serendipity Novelty vs Serendipity Beyond over-specialization Novelty Novelty occurs when the system suggests to the user an unknown item that he might have autonomously discovered. Serendipity Serendipitous recommendation helps the user to find a surprisingly interesting item that the user might not have otherwise discovered

Beyond over-specialization Novelty vs Serendipity Beyond over-specialization Solutions to surpass over-specialization of some randomness with randomness measures Genetic algorithms Elimination of items to similar

A high-level architecture content-based recommendation was presented. Content-based recommenders are not effectively used in real case scenarios due to the over-specialization problem. Further research in generating serendipitous recommendation is needed. Content-based recommenders can benefit from further research in NLP, e.g., the use of semantic analysis.