Recommendation with Differential Context Weighting

Recommendation with Differential Context Weighting Yong Zheng Robin Burke Bamshad Mobasher Center for Web Intelligence DePaul University Chicago, IL USA Conference on UMAP June 12, 2013

Overview Introduction (RS and Context-aware RS) Sparsity of Contexts and Relevant Solutions Differential Context Relaxation & Weighting Experimental Results Conclusion and Future Work

Introduction Recommender Systems Context-aware Recommender Systems

Recommender Systems (RS) Information Overload Recommendations

Context-aware RS (CARS) Traditional RS: Users Items Ratings Context-aware RS: Users Items Contexts Ratings Companion Example of Contexts in different domains: Food: time (lunch, dinner), occasion (business lunch, family dinner) Movie: time (weekend, weekday), location (home, cinema), etc Music: time (morning, evening), activity (study, sports, party), etc Book: a book as a gift for kids or mother, etc Recommendation cannot live alone without considering contexts.

Research Problems Sparsity of Contexts Relevant Solutions

Sparsity of Contexts Assumption of Context-aware RS: It is better to use preferences in the same contexts for predictions in recommender systems. Same contexts? How about multiple contexts & sparsity? An example in the movie domain: User Movie Time Location Companion Rating U1 Titanic Weekend Home Girlfriend 4 U2 Titanic Weekday Home Girlfriend 5 U3 Titanic Weekday Cinema Sister 4 U1 Titanic Weekday Home Sister? Are there rating profiles in the contexts <Weekday, Home, Sister>?

Relevant Solutions User Movie Time Location Companion Rating U1 Titanic Weekend Home Girlfriend 4 U2 Titanic Weekday Home Girlfriend 5 U3 Titanic Weekday Cinema Sister 4 U1 Titanic Weekday Home Sister? Context Matching The same contexts <Weekday, Home, Sister>? 1.Context Selection Use the influential dimensions only 2.Context Relaxation Use a relaxed set of dimensions, e.g. time 3.Context Weighting We can use all dimensions, but measure how similar the contexts are! (to be continued later) Differences between context selection and context relaxation: Context selection is conducted by surveys or statistics; Context relaxation is directly towards optimization on predictions; Optimal context relaxation/weighting is a learning process!

DCR and DCW Differential Context Relaxation (DCR) Differential Context Weighting (DCW) Particle Swarm Intelligence as Optimizer

Differential Context Relaxation Differential Context Relaxation (DCR) is our first attempt to alleviate the sparsity of contexts, and differential context weighting (DCW) is a finer-grained improvement over DCR. There are two notion in DCR Differential Part Algorithm Decomposition Separate one algorithm into different functional components; Apply appropriate context constraints to each component; Maximize the global contextual effects together; Relaxation Part Context Relaxation References We use a set of relaxed dimensions instead of all of them. Y. Zheng, R. Burke, B. Mobasher. "Differential Context Relaxation for Context-aware Travel Recommendation". In EC-WEB, 2012 Y. Zheng, R. Burke, B. Mobasher. "Optimal Feature Selection for Context-Aware Recommendation using Differential Relaxation". In RecSys Workshop on CARS, 2012

DCR Algorithm Decomposition Take User-based Collaborative Filtering (UBCF) for example. Pirates of the Caribbean 4 Kung Fu Panda 2 Harry Potter 6 Harry Potter 7 U1 4 4 2 2 U2 3 4 2 1 U3 2 2 4 4 U4 4 4 1? Standard Process in UBCF (Top-K UserKNN, K=1 for example): 1). Find neighbors based on user-user similarity 2). Aggregate neighbors contribution 3). Make final predictions

DCR Algorithm Decomposition Take User-based Collaborative Filtering (UBCF) for example. 1.Neighbor Selection 2.Neighbor contribution 3.User baseline 4.User Similarity All components contribute to the final predictions, where we assume appropriate contextual constraints can leverage the contextual effect in each algorithm component. e.g. use neighbors who rated in same contexts.

DCR Context Relaxation User Movie Time Location Companion Rating U1 Titanic Weekend Home Girlfriend 4 U2 Titanic Weekday Home Girlfriend 5 U3 Titanic Weekday Cinema Sister 4 U1 Titanic Weekday Home Sister? Notion of Context Relaxation: Use {Time, Location, Companion} 0 record matched! Use {Time, Location} 1 record matched! Use {Time} 2 records matched! In DCR, we choose appropriate context relaxation for each component. Balance # of matched ratings best performances & least noises

DCR Context Relaxation 1.Neighbor Selection 2.Neighbor contribution 3.User baseline 4.User Similarity c is the original contexts, e.g. <Weekday, Home, Sister> C1, C2, C3, C4 are the relaxed contexts. The selection is modeled by a binary vector. E.g. <1, 0, 0> denotes we just selected the first context dimension Take neighbor selection for example: Originally select neighbors by users who rated the same item. DCR further filter those neighbors by contextual constraint C1 i.e.. C1 = <1,0,0> Time=Weekday u must rated i on weekdays

DCR Drawbacks 1.Neighbor Selection 2.Neighbor contribution 3.User baseline 4.User Similarity 1. Context relaxation is still strict, especially when data is sparse. 2. Components are dependent. For example, neighbor contribution is dependent with neighbor selection. E.g. neighbors are selected by C1: Location = Cinema, it is not guaranteed, neighbor has ratings under contexts C2: Time = Weekend A finer-grained solution is required!! Differential Context Weighting

Differential Context Weighting User Movie Time Location Companion Rating U1 Titanic Weekend Home Girlfriend 4 U2 Titanic Weekday Home Girlfriend 5 U3 Titanic Weekday Cinema Sister 4 U1 Titanic Weekday Home Sister? Goal: Use all dimensions, but we measure the similarity of contexts. Assumption: More similar two contexts are given, the ratings may be more useful for calculations in predictions. Similarity of contexts is measured by Weighted Jaccard similarity c and d are two contexts. (Two red regions in the Table above.) σ is the weighting vector <w1, w2, w3> for three dimensions. Assume they are equal weights, w1 = w2 = w3 = 1. J(c, d, σ) = # of matched dimensions / # of all dimensions = 2/3

Differential Context Weighting 1.Neighbor Selection 2.Neighbor contribution 3.User baseline 4.User Similarity 1. Differential part Components are all the same as in DCR. 2. Context Weighting part (for each individual component): σ is the weighting vector ϵ is a threshold for the similarity of contexts. i.e., only records with similar enough ( ϵ) contexts can be included. 3.In calculations, similarity of contexts are the weights, for example 2.Neighbor contribution It is similar calculation for the other components.

Particle Swarm Optimization (PSO) The remaining work is to find optimal context relaxation vectors for DCR and context weighting vectors for DCW. PSO is derived from swarm intelligence which helps achieve a goal by collaborative Fish Birds Bees Why PSO? 1). Easy to implement as a non-linear optimizer; 2). Has been used in weighted CF before, and was demonstrated to work better than other non-linear optimizer, e.g. genetic algorithm; 3). Our previous work successfully applied BPSO for DCR;

Particle Swarm Optimization (PSO) Swarm = a group of birds Particle = each bird each run in algorithm Vector = bird s position in the space Vectors we need Goal = the location of pizza Lower prediction error So, how to find goal by swam? 1.Looking for the pizza Assume a machine can tell the distance 2.Each iteration is an attempt or move 3.Cognitive learning from particle itself Am I closer to the pizza comparing with my best locations in previous history? 4.Social Learning from the swarm Hey, my distance is 1 mile. It is the closest!. Follow me!! Then other birds move towards here. DCR Feature selection Modeled by binary vectors Binary PSO DCW Feature weighting Modeled by real-number vectors PSO How it works? Take DCR and Binary PSO for example: Assume there are 4 components and 3 contextual dimensions Thus there are 4 binary vectors for each component respectively We merge the vectors into a single one, the vector size is 3*4 = 12 This single vector is the particle s position vector in PSO process.

Experimental Results Data Sets Predictive Performance Performance of Optimizer

Context-aware Data Sets AIST Food Data Movie Data # of Ratings 6360 1010 # of Users 212 69 # of Items 20 176 # of Contexts Real hunger (full/normal/hungry) Virtual hunger Time (weekend, weekday) Location (home, cinema) Companions (friends, alone, etc) Other Features User gender Food genre, Food style Food stuff User gender Year of the movie Density Dense Sparse Context-aware data sets are usually difficult to get. Those two data sets were collected from surveys.

Evaluation Protocols Metric: root-mean-square error (RMSE) and coverage which denotes the percentage we can find neighbors for a prediction. Our goal: improve RMSE (i.e. less errors) within a decent coverage. We allow a decline in coverage, because applying contextual constraints usually bring low coverage (i.e. the sparsity of contexts!). Baselines: context-free CF, i.e. the original UBCF contextual pre-filtering CF which just apply the contextual constraints to the neighbor selection component no other components in DCR and DCW. Other settings in DCR & DCW: K = 10 for UserKNN evaluated on 5-folds cross-validation T = 100 as the maximal iteration limit in the PSO process Weights are ranged within [0, 1] We use the same similarity threshold for each component, which was iterated from 0.0 to 1.0 with 0.1 increment in DCW

Predictive Performances Blue bars are RMSE values, Red lines are coverage curves. Findings: 1) DCW works better than DCR and two baselines; 2) Significance t-test shows DCW works significantly in movie data, but DCR was not significant over two baselines; DCW can further alleviate sparsity of contexts and compensate DCR; 3) DCW offers better coverage over baselines!

Performances of Optimizer Running time is in seconds. Using 3 particles is the best configuration for two data sets here! Factors influencing the running performances: More particles, quicker convergence but probably more costs; # of contextual variables: more contexts, probably slower; Density of the data set: denser, more calculations in DCW; Typically DCW costs more than DCR, because it uses all contextual dimensions and the calculation for similarity of contexts is time-consuming, especially for dense data, like the Food data.

Other Results (Optional) 1.The optimal threshold for similarity of contexts For Food data set, it is 0.6; For Movie data set, it is 0.1; 2.The optimal weighting vectors (e.g. Movie data) Note: Darker smaller weights; Lighter Larger weights

It is gonna end Conclusions Future Work

Conclusions We propose DCW which is a finer-grained improvement over DCR; It can further improve predictive accuracy within decent coverage; PSO is demonstrated to be the efficient optimizer; We found underlying factors influencing running time of optimizer; Stay Tuned DCR and DCW are general frameworks (DCM, i.e. differential context modeling as the name of this framework), and they can be applied to any recommendation algorithms which can be decomposed into multiple components. We have successfully extend its applications to item-based collaborative filtering and slope one recommender. References Y. Zheng, R. Burke, B. Mobasher. "Differential Context Modeling in Collaborative Filtering ". In SOCRS-2013, Chicago, IL USA 2013

Future Work Try other similarity of contexts instead of the simple Jaccard one; Introduce semantics into the similarity of contexts to further alleviate the sparsity of contexts, e.g., Rome is closer to Florence than Paris. Parallel PSO or put PSO on MapReduce to speed up optimizer; Acknowledgement Student Travel Support from US NSF (UMAP Platinum Sponsor) See u later The 19th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), Chicago, IL USA, Aug 11-14, 2013

Thank You! Center for Web Intelligence, DePaul University, Chicago, IL USA