A Brief Review of Representation Learning in Recommender 赵鑫 RUC

A Brief Review of Representation Learning in Recommender Systems @ 赵鑫 RUC batmanfly@qq.com

Representation learning

Overview of recommender systems Tasks Rating prediction Item recommendation Basic models MF LibFM

Rating Prediction User-item matrix i1 i2 u1 1? u2? 5 u3 3? u4? 2 Online test Offline test

Item Recommendation User-item matrix i1 i2 u1 yes? u2? yes u3 no? u4 yes yes Online test Offline test Retrieval-based metrics, e.g., P@k, R@k, MAP

Context-Aware Recommendation When you know more information about users and items

More complicated tasks

Practical Considerations

Rating Prediction User-item matrix i1 i2 u1 1? u2? 5 u3 3? u4? 2 Online test Offline test

Latent Factor Models

Matrix factorization

A Basic Model

Another formulation A Basic Model

Probabilistic Matrix Factorization

Context-Aware Recommendation When you know more information about users and items

LibFM

Outline of the approaches Recommendation by network embedding Recommendation by word embedding Embedding as regularization Recommendation by TransE Recommendation by metric learning Recommendation by multi-modality fusion

What is network embedding? We map each node in a network into a lowdimensional space Distributed representation for nodes Similarity between nodes indicate the link strength Encode network information and generate node representation 20

Example Zachary s Karate Network: 21

Framework 22

LINE First-order Proximity 2 3 4 1 5 6 7 8 9 10 Vertex 6 and 7 have a large first-order proximity The local pairwise proximity between the vertices Determined by the observed links However, many links between the vertices are missing Not sufficient for preserving the entire network structure From Jian Tang s slides

LINE 2 3 4 1 5 6 7 8 9 10 Vertex 5 and 6 have a large second-order proximity p 5 = (1,1, 1,1,0,0,0,0,0,0) p 6 = (1,1, 1,1,0,0,5,0,0,0) Second-order Proximity The proximity between the neighborhood structures of the vertices Mathematically, the second-order proximity between each pair of vertices (u,v) is determined by: p u = (w u1, w u2,, w u V ) p v = (w v1, w v2,, w v V ) From Jian Tang s slides

LINE Preserving the First-order Proximity Given an undirected edge v i, v j, the joint probability of v i, v j 1 p 1 v i, v j = 1 + exp ( u T i u j ) u i : Embedding of vertexv i v i p 1 v i, v j = (i,j ) w ij w i j Objective: O 1 = d(p 1,, p 1, ) w ij log p 1 (v i, v j ) i,j E KL-divergence From Jian Tang s slides

LINE Preserving the Second-order Proximity Given a directed edge (v i, v j ), the conditional probability of v j given v i is: p 2 v j v i = exp(u j T u i ) V k=1 exp(u k T u i ) u i : Embedding of vertex i when i is a source node; u i : Embedding of vertex i when i is a target node. p 2 v j v i = w ij k V w ik Objective: O 2 = λ i d(p 2 v i, p 2 v i ) i V w ij log p 2 (v j v i ) i,j E λ i : Prestige of vertex in the network λ i = j w ij From Jian Tang s slides

LINE Preserving both Proximity Concatenate the embeddings individually learned by the two proximity First-order Second-order From Jian Tang s slides

Recommendation by network embedding Learning Distributed Representations for Recommender Systems with a Network Embedding Approach (Zhao et al, AIRS 2016) Motivation

Recommendation by network embedding Given any edge in the network

Recommendation by network embedding User-item recommendation

Recommendation by network embedding User-item-tag recommendation

Recommendation by word embedding Recall word2vec Input: a sequence of words from a vocabulary V Output: a fixed-length vector for each term in the vocabulary v w It implements the idea of distributional semantics using a shallow neural network model.

Recommendation by word embedding Generalized token2vec Input: a sequence of symbol tokens from a vocabulary V Output: a fixed-length vector for each symbol in the vocabulary v w You can imagine that all the sequences in which surrounding contexts are sensitive can potentially be modeled with word2vec.

Recommendation by word embedding POI data modeling User ID Location ID Check-in time Category label/name GPS information Check-in information User connections

A sequential way to model POI data Given a user u, a trajectory is a sequence of check-in records related to u User ID Location ID Check-in Timestamp u1 l181 2016-08-26 9:26am u1 l32 2016-08-26 10:26am u1 l323 2016-08-25 11:26am u1 l32323 2016-08-25 1:26pm u2 l345 2016-08-26 9:16am u2 l13 2016-08-26 10:36am

Task Input: Check-in sequences together with user relations Output: Embedding representations for users, locations and other related information Zhao et al., ACM TKDD 2017

Recall CBOW CBOW predicts the current word using surrounding contexts Pr(w t context(w t )) Window size 2c context(w t ) = [w t c,, w t+c ]

Model sequential relatedness A direct application of doc2vec

Modeling social connectedness A skip-gram way to model all the friends

A joint model to characterize trajectories and links Jointly optimizing the two loss functions

Modeling multi-grained sequential contexts A long trajectory sequence can be split into multiple segments User ID Location ID Check-in Timestamp u1 l181 2016-08-26 9:26am u1 l32 2016-08-26 10:26am u1 l323 2016-08-25 11:26am u1 l32323 2016-08-25 1:26pm u1: s1 s2 s1: l181 l32 s2: l323 l32323

Modeling multi-grained sequential contexts Modeling segment-level relatedness u1: s1 s2 s1: l181 l32 s2: l323 l32323

Modeling multi-grained sequential contexts Modeling location-level relatedness u1: s1 s2 s1: l181 l32 s2: l323 l32323

The joint hierarchical model Jointly optimizing three objective functions

Recommendation by word embedding Token2vec for product recommendation Doc2vec (Zhao et al., IEEE TKDE 2016) Doc user Word product A user profiling way

Recommendation by word embedding Token2vec for next-basket recommendation (Wang et al., SIGIR 2015)

Matrix factorization Motivation It mainly captures user-item interactions The item co-occurrence across users has been ignored Liang et al., RecSys 2016

Item embedding Motivation Levy and Goldberg show an equivalence between skip-gram word2vec trained with negative sampling value of k and implicit factorizing the pointwise mutual information (PMI) matrix shifted by log k. We can factorize the item co-occurrence matrix to obtain item embeddings

The joint model MF with embedding regularization

TransE Characterizing the triple relations

Next recommendation scenario What s the next movie to watch? He et al., RecSys 2017

Next recommendation scenario What s the next movie to watch? A traditional method Markov chain and factorized Markov chain

Next recommendation scenario What s the next movie to watch? A TransE based approach

Metric learning for recommendation Metric A metric on a set X is a function The following conditions are satisfied

Metric learning for recommendation Metric learning The most original metric learning approach attempts to learn a Mahalanobis distance metric We can define the objective function

Metric learning for recommendation Metric Learning for knn Large margin nearest neighbor (LMNN) Pull loss Push loss

Metric learning for recommendation Representation-based metric learning Distance function Loss function Hsieh et al., WWW 2017

Metric learning for recommendation Representation-based metric learning Improving representations by integrating item features Regularization The joint loss

Multi-modality representation Rich side information

Multi-modality representation Rich side information Zhang et al., KDD 2016

Multi-modality representation Rich side information Modeling KB information

Multi-modality representation Rich side information Modeling text information

Multi-modality representation Rich side information Modeling image information

Multi-modality representation Rich side information Generative process

Multi-modality representation Complementary effect of visual and textual features Chen et al., to appear in AIRS 2017

Multi-modality representation A Multi-task learning method Chen et al., to appear in AIRS 2017

Future work ItemKNN MF (svd++) BPR FM? Why svd++, BPR and FM perform so well consistently on various datasets? How recommender systems borrow ideas from representation learning and deep learning? What is the future direction for recommender systems?

Thanks Wayne Xin Zhao, Sui Li, Yulan He, Edward Y. Chang, Ji-Rong Wen, Xiaoming Li: Connecting Social Media to E-Commerce: Cold-Start Product Recommendation Using Microblogging Information. IEEE Trans. Knowl. Data Eng. 28(5): 1147-1159 (2016) Pengfei Wang, Jiafeng Guo, Yanyan Lan, Jun Xu, Shengxian Wan, Xueqi Cheng. Learning Hierarchical Representation Model for NextBasket Recommendation. SIGIR 2015: 403-412 Xu Chen, Yongfeng Zhang, Wayne Xin Zhao and Zheng Qin. A Collaborative Neural Model for Rating Prediction by Leveraging User Reviews and Product Images. To appear in AIRS 2017. Wayne Xin Zhao, Feifan Fan, Ji-Rong Wen, Edward Chang. Joint Representation Learning for Location-based Social Networks with Multi-Grained Sequential Contexts. To appear in ACM TKDD. Liang, J. Altosaar, L. Charlin, and D. Blei. Factorization meets the item embedding: Regularizing matrix factorization with item co-occurrence. ACM RecSys, 2016. Ruining He, Wang-Cheng Kang, Julian McAuley. Translation-based Recommendation. RecSys 2017: 161-169 Cheng-Kang Hsieh, Longqi Yang, Yin Cui, Tsung-Yi Lin, Serge J. Belongie, Deborah Estrin. Collaborative Metric Learning. WWW 2017: 193-201 Fuzheng Zhang, Nicholas Jing Yuan, Defu Lian, Xing Xie, Wei-Ying Ma. Collaborative Knowledge Base Embedding for Recommender Systems. KDD 2016: 353-362