Signed Distance-based Deep Memory Recommender

Size: px

Start display at page:

Download "Signed Distance-based Deep Memory Recommender"

Randall Ira Franklin
5 years ago
Views:

Sgned Dstance-based Deep Memory Recommender ABSTRACT Personalzed recommendaton algorthms learn a user s preference for an tem, by measurng a dstance/smlarty between them.

Ths approach may lmt the capacty of the recommender systems, snce the nteractons between users and tems n real-world applcatons are much more complex than the lnear relatonshp.

drectly and ndrectly, and work well n both general recommendaton task and shoppng basket-based recommendaton task.

Our source code s avalable at an anonymzed URL. ACM Reference Format:. 2018. Sgned Dstance-based Deep Memory Recommender. In Proceedngs of ACM WWW conference (WWW 19. ACM, New York, NY, USA, 12 pages.

They have played a vtal role for users to explore new tems and for companes to ncrease ther revenues.

1 Sgned Dstance-based Deep Memory Recommender ABSTRACT Personalzed recommendaton algorthms learn a user s preference for an tem, by measurng a dstance/smlarty between them. However, some of exstng recommendaton algorthms (e.g., matrx factorzaton assume a lnear relatonshp between the user and tem. Ths approach may lmt the capacty of the recommender systems, snce the nteractons between users and tems n real-world applcatons are much more complex than the lnear relatonshp. To overcome ths lmtaton, n ths paper, we desgn and propose a deep learnng framework called Sgned Dstance-based Deep Memory Recommender, whch captures non-lnear relatonshp between users and tems drectly and ndrectly, and work well n both general recommendaton task and shoppng basket-based recommendaton task. Through extensve emprcal study on sx real-world datasets n the two recommendaton tasks, our proposed approach acheved sgnfcant mprovement over ten state-of-the-art recommendaton models. Our source code s avalable at an anonymzed URL. ACM Reference Format: Sgned Dstance-based Deep Memory Recommender. In Proceedngs of ACM WWW conference (WWW 19. ACM, New York, NY, USA, 12 pages. 1 INTRODUCTION Recommender systems [1] have been deployed n many onlne applcatons such as e-commerce, musc/vdeo streamng servces, socal meda, etc. They have played a vtal role for users to explore new tems and for companes to ncrease ther revenues. Most of recommendaton algorthms model user preference and tem propertes based on observed nteractons (e.g., clcks, revews, ratngs between users and tems [20, 21, 30]. In a perspectve, we can vew most of the recommendaton models as a measurement of smlarty or dstance between a user and an tem. For nstance, the well known latent factor (.e., matrx factorzaton models [19] usually employ an nner product functon to approxmate the smlarty between the user and the tem. Although the latent factor models acheved compettve performance n some datasets, they dd not correctly capture complex (.e., non-lnear relatonshps between users and tems because the nner product functon follows lmted lnear nature. Exstng recommendaton algorthms face dffculty n fndng good kernels for dfferent data patterns [30], only focused on usertem latent space wthout consderng the tem-tem latent space together [12, 24, 25, 42, 56], or requred addtonal auxlary nformaton (e.g., tem descrpton, musc content, revews [4, 17, 29, 31, 52]. By overcomng the drawbacks, n ths paper we am to propose Permsson to make dgtal or hard copes of part or all of ths work for personal or classroom use s granted wthout fee provded that copes are not made or dstrbuted for proft or commercal advantage and that copes bear ths notce and the full ctaton on the frst page. Copyrghts for thrd-party components of ths work must be honored. For all other uses, contact the owner/author(s. WWW 19, May 2019, San Francsco, Calforna USA 2018 Copyrght held by the owner/author(s. ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. user-tem latent space Eucldean dstance learned by SDP embeddng Consumed attenton + Target Par tem-tem latent space embeddng embeddng Personalzed Metrc-based Attenton learned by SDM Fgure 1: We consder a recommender as a sgned dstance approxmator, and decompose the sgned dstance between a user and an tem nto two parts: the left box learns a sgned dstance between the user and tem (.e., the camera lens, the rght box learns a sgned dstance between the tem and the user s recently consumed tems (.e., the book, CD and camera. Our novel personalzed metrc-based soft attenton s appled to the consumed tems to optmze ther contrbutons to the fnal output sgned dstance score. Then the two parts are combned to obtan a fnal score. Most of lnear latent factor models are equvalent to smply measurng the lnear Eucldean dstance n the user-tem latent space (shown as the green lne. and buld a deep learnng framework to learn non-lnear relatonshp between a user and a target tem by measurng a dstance from the observed data. In partcular, we propose Sgned Dstance-based Deep Memory Recommender (SDMR, whch captures non-lnear relatonshp of the user and tem drectly and ndrectly, combne drectly and ndrectly measured relatonshp to produce a fnal dstance score for the recommendaton, and works well n both general recommendaton task and shoppng basket-based recommendaton task. SDMR nternally combnes two sgned dstances, each of whch s measured by our proposed Sgned Dstance-based Perceptron (SDP and Sgned Dstance-based Memory Network (SDM. On one hand, SDP drectly measures a non-lnear sgned dstance between the user and the tem. Many exstng models [13, 16] rely on a pre-defned metrc such as Eucldean dstance (the green lne n Fgure 1 whch s much more lmted than the customzed non-lnear sgned dstance learned from the data (the red curves n Fgure 1. On the other hand, SDM ndrectly measures a non-lnear sgned dstance between the user and the tem va the user s recently consumed tems. SDM s smlar to the tem neghborhood-based recommender [35, 41] n nature. However, t s more advanced n several aspects, as shown n the rght sde of Fgure 1. Frst, SDM only focus on a set of recently consumed tems of the target user (e.g., the book, CD and camera n Fgure 1 as context tems. Second, t employs addtonal memores to learn a novel personalzed metrc-based attenton on the consumed tems. The goal of our proposed attenton s to compute weghts of each consumed tem w.r.t. the target tem (.e., the camera lens. In the example, the attenton module assgns hgher weghts

2 WWW 19, May 2019, San Francsco, Calforna USA on the camera and lower weghts on the book and CD. Unlke our approach, most of the exstng neghborhood-based models consder contrbuton of consumed tems to the target tem equally, and t may lead to unsatsfactory results. Last but not the least, we update the attenton weghts va a gated mult-hop to buld a long-term memory wthn SDM. Ths mult-hop desgn helps refne our attenton module and produces more accurate attentve scores. The contrbutons of ths work are summarzed as follows: We desgn a deep learnng framework whch can tackle both general recommendaton task and shoppng basket-based recommendaton task. We propose SDMR that combnes two sgned dstance scores nternally measured by SDP and SDM, whch capture non-lnear relatonshp between a user and an tem drectly and ndrectly. To better balance the weghts among consumed tems of the user, we propose a novel mult-hop memory network wth a personalzed metrc-based attenton mechansm n SDM. Extensve experments on sx datasets n two dfferent recommendaton tasks demonstrate the effectveness of our proposed methods aganst ten baselnes. 2 RELATED WORK Latent Factor Models (LFM have been extensvely studed n the lterature, whch nclude Matrx Factorzaton [16], Bayesan Personalzed Rankng [38], fast matrx factorzaton for mplct feedbacks (eals [13], etc. Despte ther success, LFM suffer from several lmtatons. Frst, LFM overlook assocatons between the user s prevously consumed tems and the target tem (e.g. moble phones and phone cases. Second, LFM usually rely on nner product functon, whose lnearty lmts the capablty of modelng complex user-tem nteractons. To address the second ssue, several non-lnear latent factor models have been proposed, wth the help of Gaussan process [23] or kernels [30, 61]. However, they ether requre expensve hyper-parameter tunng or face dffculty n fndng good kernels for dfferent data patterns. Neghborhood-based models [35, 41] are usually based on the prncple that smlar users prefer smlar tems. The problem turns nto fndng the neghbors of a user or an tem based on a pre-defned dstance/smlarty metrc, such as cosne vector smlarty [3, 22], Person Correlaton smlarty [6], etc. The recommendaton qualty hghly depends on a chosen metrc, but fndng a good pre-defned metrc s usually very challengng. Furthermore, these models are also senstve to the selecton of neghbors. Our proposed SDM s smlar to neghborhood-based models n nature, but t explots a novel personalzed metrc-based attenton for assgnng attentve weghts to neghbor tems. Therefore, our approach s more robust and less senstve than conventonal neghborhood-based models. NeuMF [12] s a neural network that generalzes matrx factorzaton va Mult Layer Perceptron (MLP for learnng non-lnear nteracton functons. Smlarly, some other works [24, 25, 42, 56] substtute MLP wth auto-encoder archtecture. It s worth notng that all these approaches are lmted by only consderng the usertem latent space, and overlook the correlatons n the tem-tem latent space. Besdes, some deep learnng based works [32, 44, 50] employ auxlary nformaton such as tem descrpton [17], musc content [52], tem vsual features [4, 29], revews [31] to address the cold-start problem. However, ths auxlary nformaton s not always avalable, and t lmts ther applcablty n many real-world systems. Another lne of works use deep neural networks to model temporal effects of consumed tems [14, 36, 48, 55]. Although our proposed methods do not explctly consder the temporal effects, SDM utlzes the tme nformaton to select a set of recently consumed tems as the neghborhoods of the target tem. The most closely related work to our work s recently proposed Collaboratve Memory Network (CMN [7]. In ths work, Memory Network [47] s adapted to measure the smlartes between users and user neghbors. Key dfferences between our work and CMN are as follows: ( Frst, we follow an tem neghborhood based desgn, whereas CMN follows a user neghborhood based desgn. The pror work showed that tem neghborhood based models slghtly outperformed user neghbor based models [27, 41]; ( Second, our proposed SDM model uses our proposed personalzed metrc-based attenton mechansm and produces sgned dstance scores as output, whereas CMN exploted a tradtonal nner product based attenton; ( Thrd, we use a gated mult-hop archtecture [28], whereas CMN used the orgnal mult-hop desgn. The pror study showed that a gated mult-hop desgn outperformed the orgnal mult-hop desgn [47]. 3 PROBLEM STATEMENT In ths secton, we descrbe two recommendaton problems: ( general recommendaton task; and ( shoppng basket-based recommendaton task. In the followng sectons, we focus on solvng these two tasks. General recommendaton task: Gven a whole tem setv = {v 1,v 2,...,v V }, and a whole user set U = {u 1,u 2,...,u U }. Each user u U may consume several tems {v 1,v 2,...,v k} n V, denoted as a set of neghbor tems c. In ths task, gven a user s prevously consumed tems, a recommendaton model predcts a next target tem v j that user u may prefer, denotng ths task as estmatng P (u,v j c. Note that some exstng works assume ndependent relatonshps between v j and neghbor tems n the set c, leadng to P (u,v j c = P (u,v j [12, 13]. In our work, we model the u s preference on v j n two steps: ( a drect preference of u on v j n a sgned dstance based perceptron, and ( an ndrect preference of u on v j va summng attentve effects of neghbor tems toward target tem v j n a sgned dstance based memory network. Shoppng Basket-based recommendaton task: Ths problem s based on the fact that users go shoppng offlne/onlne and add some tems nto a basket/cart together. Each shoppng basket/cart s seen as a transacton, and each user may shop once or multple tmes, leadng to one or multple transactons. Let T (u = {t 1, t 2,..., t T (u } as a set of the user u s transactons, where T (u denotes the number of user u s transactons. Each transacton t = {v 1,v 2,...,v t } conssts of several tems n the whole tem set V. In ths problem, t s assumed that all the tems n t are nserted nto the same basket at the same tme, gnorng the actual order of the tems beng nserted and consderng t s transacton tme as each tem s nserton tme. Gven a target tem v j t, the rest of the tems n t wll be seen as the context tems or neghbor tems of v j, denoted as c (.e. c = t \v j. Then, gven the set of neghbor tems c, a recommendaton model

3 Sgned Dstance-based Deep Memory Recommender WWW 19, May 2019, San Francsco, Calforna USA user embeddng estmated sgned dstance o (SDP target user concatenate BPR loss e (l+1 e (l MLP layers e (1 target tem element-wse square ground truth tem embeddng Fgure 2: The llustraton of our SDP model. predcts a condtonal probablty P (u,v j c, whch s nterpreted as the condtonal probablty that u wll add the tem v j nto the same basket wth the other tems c. Both two recommendaton tasks above are popular n the lterature [8, 12, 36, 39]. The general recommendaton task dffers from the shoppng basket-based recommendaton task because there s no specfc context tems of the target tem n the general recommendaton task. In ths paper, we focus on personalzed recommendaton tasks because they are more preferred than the non-personalzed recommendaton tasks n the lterature [8, 36, 39]. 4 PROPOSED METHODS Our proposed Sgned Dstance-based Deep Memory Recommender (SDMR conssts of two major components: Sgned Dstance-based Perceptron (SDP and Sgned Dstance-based Memory network (SDM. We frst descrbe an overvew of our proposed models as follows: Gven a target user and a target tem j as two one-hot vectors, we pass the two vectors through the user and tem embeddng spaces to get user embeddng u and tem embeddng v j. On one hand, our proposed Sgned Dstance-based Perceptron (SDP wll measure a sgned dstance score between u and v j by a mult-layer perceptron network. On the other hand, gven target user, target tem j, and the user s recently consumed neghbor tems s as the nput, our Sgned Dstance-based Memory network (SDM wll measure a sgned dstance score between user and tem j va attentve dstances between neghbor tems s and target tem j. Then, the Sgned Dstance-based Deep Memory Recommender (SDMR model wll measure a total dstance between user and tem j by learnng a combnaton of SDP and SDM. The smaller the total dstance s, the more lkely the user wll consume the tem j. Next, we descrbe SDP and SDM, and SDMR n detal. 4.1 Sgned Dstance-based Perceptron (SDP We frst propose Sgned Dstance-based Perceptron (SDP that drectly learns a sgned dstance between a target user and a target tem j. An llustraton of SDP s shown n Fgure 2. Let the embeddng of a target user be u R d, and the embeddng of a target tem j be v j R d, where d s the number of dmensons n each embeddng. Frst, SDP takes a concatenaton of these two embeddngs as the nput and proceeds as follows: e (1 = f 1 (W (1 [ u v j ] + b (1 (1 e (2 = f 2 (W (2 e (1 + b (2 (2 (3 e (l = f l (W (l e (l 1 + b (l (4 e (l+1 = square(e (l (5 o (SDP = w (o e (l+1 + b (o (6 where f l ( refers to a non-lnear actvaton functon at the layer l th (e.g. sgmod, ReLu or tanh, and square( denotes an elementwse square functon (e.g square([2, 3] = [6, 9]. Through expermental results, we choose tanh as the actvaton functon because t yelds slghtly better results than ReLu. From now on, we wll use f ( to denote the tanh functon. It can be easly observed that Eq. (1 (4 form a trval Mult-Layer Perceptron (MLP network, whch s a popular desgn [12, 59] to learn a complex and non-lnear nteracton between user embeddng u and tem embeddng v j. Our new desgn starts at Eq. (5 Eq. (6. In Eq. (5, we apply the element-wse squared functon square( to the output vector e (l of the MLP and obtan a new output vector e (l+1. Next, n Eq. (6, we use a fully connected layer w (o to combne dfferent dmensons n e (l+1 and yelds a fnal dstance value o (SDP. Our dea of usng w (o n here s that after applyng the element-wse square functon square( n Eq. (5, all the dmensons n e (l+1 wll be non-negatve. Thus, we consder each dmenson of e (l+1 as a dstance value. The edge weghts w (o wll then be used to combne those dstant dmensons to provde a more fne-graned dstance. We note that SDP can be reduced to a squared Eucldean dstance wth the followng settng: at Eq. [(1,] W (1 = [1, 1] wth 1 denotes an dentty matrx 1 and so W (1 u = u v v j ; the actvaton f ( j s an dentty functon; the number of MLP layers l = 1; the edgeweghts layer at Eq. (6: w (o = 1 (e.g. the all-ones matrx, bas b (o = 0. Note that f w (o n Eq. (6 s an all-negatve layer, t wll yeld a negatve value, whch we name as a sgned dstance 2 score. If we see each user as a pont n mult dmensonal space, and the user s preference space s defned by a boundary Ω, we can nterpret ths sgned dstance score as follows: When the tem j s out of the user s preference boundary Ω, the dstance d(, j between them s postve (.e. d(, j > 0 and t reflects that user does not prefer tem j. When the dstance between user and tem j s shortened and j s rght on the boundary Ω, the dstance between them s zero and t ndcates user lkes tem j. As j s comng nsde Ω, the dstance between them becomes negatve and reflects a hgher preference of user on tem j. In short, we can see SDP as a sgned dstance functon, whch could learn a complex sgned

WWW 19, May 2019, San Francsco, Calforna USA Input Memory user embeddng U ( tem embeddng V ( user embeddng U (o tem embeddng V (o Output Memory Input Module target user target tem target user target

4 WWW 19, May 2019, San Francsco, Calforna USA Input Memory user embeddng U ( tem embeddng V ( user embeddng U (o tem embeddng V (o Output Memory Input Module target user target tem target user target tem Wa parwse concat qj extract Wb parwse concat pj extract Personalzed Metrc-based Attenton Module Wc Wc Wc Wc Wc Wd Wd Wd Wd Wd L2-norm L2-norm L2-norm L2-norm L2-norm square square square square square zj softmax aj weghted summaton Output Module attenton weghts ej ground truth we BPR loss oj estmated sgned dstance Fgure 3: The llustraton of sngle-hop SDM, whch conssts of a memory module, an nput module, an attenton module, and an output module. dstance between a user and an tem va a MLP archtecture wth nonlnear actvatons and an element-wse square functon square(. In the recommendaton doman, the sgned dstances wll provde more fne-graned dstance values, thus, reflectng a user preferences on tems more accurately (.e. accurately rank tems for the user. 4.2 Sgned Dstance-based Memory Network (SDM We propose a mult-hop memory network, Sgned Dstance-based Memory network (SDM, to model ndrect preference of a user on the target tem va the user s prevously consumed tems (.e., neghbor tems. The ndrect preference s represented as a sgned dstance. Frst, we descrbe a sngle-hop SDM, and then descrbe how to expend t nto the mult-hop. Followng the tradtonal archtecture of a memory network [28, 47, 57], our proposed sngle-hop SDM has four man components: a memory module, an nput module, an attenton module, and an output module. The overvew of SDM s archtecture s presented n Fgure 3. We wll go nto detals of each SDM s module as follows: Memory Module: We mantan two memores called nput memory and output memory. The nput memory contans two embeddng matrces U ( R M d and V ( R N d, where M and N are the number of users and the number of tems n the system, respectvely. d denotes the embeddng sze of each user and each tem. Smlarly, the output memory also contans two embeddng matrces U (o R M d and V (o R N d. As shown n Fgure 3, the nput memory wll be used to calculate attenton weghts of a user s consumed tems (.e., neghbor tems, whereas the output memory wll be used to measure a fnal sgned dstance between the target user and the target tem va the user s neghbor tems. Gven a target user, a target tem j and a set of user s consumed tems as neghbor tems T j, the output of ths module s the embeddngs of user, tem j, and all neghbor tems k T j : (u,v j, <v 1,v 2,...,v k >. Snce ths module has a separated nput memory and output memory, we obtan (u ( the output of the nput memory, and (u (o,v ( j, <v ( 1,v(,v (o j 2,...,v( > as k, <v (o 1,v(o 2,...,v(o k > as the output of the output memory. It s obvous that u ( s the -th row of U (, v ( j and v ( k V (. A smlar explanaton s appled to u (o are the correspondng j-th and k-th row of v (o j, and v (o k Input Module: The goal of the nput module s to form a non-lnear combnaton between the target user embeddng and the target tem embeddng. Gven the target user embeddng u ( and the target tem embeddng v ( j from the nput memory n the memory module, followng the wdely adopted desgn n multmodal deep learnng work [46, 60], the nput module smply concatenates the two embeddngs, and then apples a fully connected layer wth a non-lnear actvaton f ( (.e. tanh functon to obtan a coherent hdden feature vector as follows: u q j = f (W ( a v ( + b a (7 j where W a R d 2d s the weghts of nput module. Note that q j R d can be seen as a query embeddng n Memory Network [47]. Smlarly, f the nputs of the nput module are the target user embeddngs u (o and the target tem embeddngs v (o j from the output memory, we can form a non-lnear combnaton between u (o and v (o j (.e. an output query, denoted as p j, as follows: p j = f (W b u (o v (o j + b b Attenton Module: The goal of the attenton module s to assgn attentve scores to dfferent neghbor tems (or canddates gven the combned vector (or a query q j of the target user and target tem j obtaned n Eq. (7. Frst, we calculate the L2 dstance between the nput query q j and each canddate tem v ( as follows: k z jk = ( q j f W c v ( 2 + b c (9 where 2 refers to the L2 dstance (or Eucldean dstance, whch s wdely used n prevous works to measure smlarty among tems [8] or between users and tems [15]. To better understand our ntuton n Eq.(9, we wll break t nto smaller parts and explan them. Frst, smlar to the ntuton of Eq. (7, we have ( q j f W c v ( + b c component to defne a non-lnear combnaton between the nput query q j and each neghbor tem embeddngs v ( k. Then, 2 wll measure the L2 dstance of the combned vector. It s worth to note that wth a followng settng: W a = [0, 1] where 1 refers to an dentty matrx and 0 s an all-zeros matrx; f ( s an dentty functon; W c = [1, 1]; bas terms b a = b c = 0. u Then, n Eq. (7, q j = f (W ( a v ( + b a = v ( j ; n Eq. (9, j ( q j f W c v ( + b c = v ( j v ( k, and z jk = (v ( j v ( k 2, whch smply generalzes a L2 dstance between the target tem j and the neghbor tem k. Addtonally, wth another settng: W a = [1, 1]; f ( s an dentty functon; W c = [1, 1]; bas terms u b a = b c = 0. Then, n Eq. (7, q j = f (W ( a v ( + b a = j (8

5 Sgned Dstance-based Deep Memory Recommender ( u ( v ( q j j, n Eq. (9, f W c v ( + b c = u ( v ( j + v ( k, and z jk = (v ( k + u( v ( j 2, whch smply generalzes a L2 dstance between the target tem j and the neghbor tem k where the user plays as a translator [9]. The two examples above show that our proposed desgn can learn a more generalzed dstance between target and neghbor tems. The output L2 dstance n Eq. (9 wll show how smlar the target tem j and the neghbor tem k are. The lower the dstance score s, the more smlar two tems j and k are. Next, we use the Softmax functon to normalze and obtan attentve score between j and k as follows: exp( z jk a jk = (10 p Tj exp( z jp where Tj s the set of user s neghborhood tems. The mnus sgn n Eq. (10 s used to assgn a hgher attenton score for a lower dstance between two tems (j, k. We note that the L2 dstance (or Eucldean dstance satsfes four condtons of a metrc 3. Whle the crucal trangle nequalty property of a metrc was shown to provde a better performance compared to the nner product [15, 37, 45] n recommendaton domans, to our best of knowledge, most of exstng attenton desgns [2, 5, 26, 33, 43, 53, 58] adopted the nner product for measurng attentve scores. Hence, ths proposed attenton desgn s the frst attempt to brng metrc propertes nto the attenton mechansm. Smlar to [49], we lmt the number of consderng neghbor/context tems by choosng the user s s most recently consumed tems before target tem j as the neghbor tems of target tem j. Here, s can be selected va tunng wth a development dataset. The soft attenton vector contanng attentve contrbuton scores of s neghbor tems toward the target tem j of a user s gven as follows: a j1 a j = (11 a js Output Module: Gven the attentve scores a j n Eq.(11 and the combned vector p j R d of the user embeddng u (o and tem embeddng v (o j from the output memory U (o and V (o, the goal of ths output module s to measure a total output dstance o (SDM j between the output target tem embeddngs v (o j and all the user s output neghbor tem embeddngs v (o k (k T j usng attenton weghts a j and the output query p j as follows: o (SDM j = w e e j + b e (12 where e j R d s calculated as follows: ( ( p j e j = a jk square f W d v (o + b d (13 k Tj ( p j In here, let r jk = f W d v (o + b d. Smlar to the prevously dscussed ntuton n Eq (9, r jk s a flexble combnaton between p j and each output neghbor tem embeddngs v (o ; square( s an k 3 Input Memory Output Memory WWW 19, May 2019, San Francsco, Calforna USA Input Module q (0 p (0 Attenton Module Output Module Wg (0 a (0 e (0 g (0 q (0 1-g (0 + q (1 q (h p (h Attenton Module Output Module a (h e (h Fgure 4: The llustraton of our mult-hop SDM. we ground truth BPR loss o (SDM estmated sgned dstance element-wse squared functon. Our dea n Eq. (12, (13 s smlar to the dea n Eq. (5, (6 of the SDP model. Frst, n Eq. (13, each neghbor tem k wll attentvely contrbute to the target tem j va a squared Eucldean measure. Second, n Eq. (12, each non-negatve dmenson n e j wll be consdered as a dstance dmenson and we use an edge-weghts layer w e to combne them flexbly. When there s only one neghbor tem n Tj, then n Eq. (13, the attenton score a jk =1.0, leads to e j = square(r jk, whch s smlar to Eq. (5. In ths case, SDM wll measure the dstance between target tem j and neghbor tem k n the same way as SDP model does. Note that Eq. (13 s smlar to Eq. (6 so SDM can also learn a sgned dstance value, whch also provdes a more fne-graned dstance compared to a general dstance value Mult-hop SDM:. Inspred by prevous work [47] where the mult-hop desgn helped to refne the attenton module n Memory Network, we also ntegrate multple hops to further extend our SDM model to buld a deeper network (Fgure 4. As the gated mult-hop desgn [28] was shown to perform better than the orgnal mult-hop desgn wth a smple resdual connecton n [47], we employ ths gated memory update from hop to hop as follows: д (h 1 = σ (W д (h 1 q (h 1 + b д (h 1 (14 q (h = (1 д (h 1 e (h 1 + д (h 1 q (h 1 (15 where q (h 1 s the nput query embeddng as shown n Eq. (7 at hop h 1, W д (h 1 and bas b д (h 1 are hop-specfc parameters, σ s the sgmod functon, e (h 1 s the output of Eq. (13 at hop h 1, q (h s the nput query embeddng at the next hop h. So the attenton could be updated at hop h accordngly usng q (t as follows: α (h jk = (h exp( z jk p Tj exp( z (h (16 jp where z (h s measured by: jk z (h jk = ( f W c (h q (h j v ( 2 + b c (17 The mult-hop archtecture wth gated desgn further refnes the attenton for dfferent users based on the prevous output from hop to hop. Hence, f the fnal hop s h then the SDM model wth h hops, denoted as SDM-h, wll use a (h j to yeld a fnal sgned dstance

6 WWW 19, May 2019, San Francsco, Calforna USA score as follows: o (SDM h j where e j s calculated as: e (h j = k T j = w e e (h j ( ( a (h jk square f W (h d + b (h e (18 p (h j v (o + b (h d (19 Weght constrants n mult-hop SDM model: To save memory, we use the global weght constrant n mult-hop SDM. Partcularly, nput memory U (,V ( and output memory U (o,v (o are shared among dfferent hops. All the weghts are shared from hop to hop W a (1 = W a (2 =... = W a (h ; W (1 = W (2 b b =... = W c (h ; W (1 = W (2 d d =... = W (h d =... = W (h ; W (1 b c = W c (2 ; and so do all bas terms. The gate weghts are also global weghts: W д (1 = W д (2 4.3 Sgned Dstance-based Deep Memory Recommender (SDMR =... = W (h д. Now we propose Sgned Dstance-based Deep Memory Recommender (SDMR, a hybrd network that combnes SDP and SDM. The frst approach to combne them s to employ a weghted summaton of the output scores from SDP and SDM as follows: o = βo (SDP + (1 βo (SDM (20 where o (SDP s the sgned dstance score obtaned at Eq. (6, o (SDM s the sgned dstance score obtaned at Eq. (18, and β [0, 1] s a hyper-parameter to control the contrbuton of SDP and SDM. When β=0, SDMR becomes SDM. When β=1, SDMR becomes SDP. However, to avod tunng an addtonal hyper-parameter β, we do not use Eq. (20 for SDMR. Instead, we let SDMR self-learns the combnaton of SDM and SDM as follows: ( [ ] o = ReLU wu e (l+1 e (h + b u (21 where e (l+1 s the fnal layer embeddng from SDP and s obtaned at Eq. (5, e (h s the fnal hop output from the mult-hop SDM obtaned at Eq. (19. We note that SDP and SDM are frst pre-traned separately usng the BPR loss functon (see the next secton. Then, we obtan e (l+1 from SDP, and e (h from SDM, and keep them fxed n Eq. (21 to learn w u and b u. We use ReLU n Eq. (21 because ReLU encourages sparse actvatons and helps to reduce over-fttng when combnng the two components SDP and SDM. 4.4 Loss Functons In our models, we adopt the Bayesan Personalzed Rankng (BPR optmzaton crteron as our loss functon, whch s smlar to the dea of AUC (area under the curve: L = argmn θ ( log σ (o u o u + + λ θ 2 (22 (u, +, where we unformly sample tuples n a form of (u, +, for user u wth postve tem (consumed + and negatve tem (unconsumed. λ s a hyper-parameter to control the regularzaton term, and σ ( s the sgmod functon. Note that other parwse probablty functons could be plugged n Eq. (22 to replace σ (. Both SDP and SDM are end-to-end dfferentable snce we uses soft attenton over the output memory. Hence, we can utlze back-propagaton to learn our models wth stochastc gradent descent or Adam [18]. 5 EMPIRICAL STUDY In ths secton, we evaluate our proposed SDP, SDM, and SDMR models aganst ten state-of-the-art baselnes n two recommendaton tasks: ( general recommendaton task, and ( shoppng basketbased recommendaton task. We desgn our expermental evaluaton to answer the followng research questons (RQs: RQ1: How do SDP, SDM, and SDMR perform compared to other state-of-the-art models n both general recommendaton task and shoppng basket-based recommendaton task? RQ2: Why/How does the mult-hop desgn help to mprove the proposed models performance? 5.1 Datasets In each recommendaton task, we conduct experments on the followng datasets: General recommendaton task: In ths task, we evaluate our proposed models and state-of-the-art methods usng dfferent datasets wth varous densty levels as follows: Movelens [40]: It s a wdely adopted benchmark dataset for collaboratve flterng evaluaton. We use two versons of ths benchmark dataset, namely MoveLens100k (or ML-100k and MoveLens1M (or ML-1M. Netflx Prze 4 : It s a real-world dataset collected by Netflx. Ths dataset was collected from 1999 to 2005, and conssts of 463,435 users and 17,769 tems wth 56.9M of nteractons. Snce the dataset s extremely large, we subsample the Netflx dataset by randomly pckng one-month data for evaluaton. Epnons [34] 5 : It s a well-known onlne ratng dataset where users can share product feedback by gvng explct ratngs and revews. In preprocessng preparaton, we adopted a popular k-core preprocessng step [11, 25, 51] (wth k-core = 5 to flter out nactve users wth less than fve ratngs and tems whch are consumed by less than fve users. Snce ML-100k and ML-1M are already preprocessed, we only apply 5-core preprocessng step on the Netflx and Epnons datasets. We also bnarze the ratng scores as mplct feedback by convertng all observed ratng scores as postve nteractons and the remanng as negatve nteractons. The statstcs of the four datasets are summarzed n Table 1. Shoppng basket-based recommendaton task: We evaluate our proposed models on two real-world transacton datasets as follows: IJCAI-15 6 : It s a well-known shoppng basket-based dataset. It conssts of shoppng logs of users from Tmall 7. Snce the orgnal dataset s extremely large scale. We subsample IJCAI-15 by randomly pckng 20k transactons for evaluaton

7 Sgned Dstance-based Deep Memory Recommender Table 1: Statstcs of the four datasets n the general recommendaton task. Statstcs ML-100k ML-1M Netflx Epnons # of users 943 6,040 1,888 23,137 # of tems 1,682 3,706 3,724 23,585 # of nteractons 100,000 1,000, , ,982 Densty (% 6.3% 4.5% 1.5% 0.08% Table 2: Statstcs of the two real-world transactonal datasets n the shoppng basket-based recommendaton task. Statstcs IJCAI-15 Tafeng # of users 2,433 22,851 # of tems 4,534 22,291 avg # of tems n a transacton # of generated nstances 15, ,653 Densty (% 0.14% 0.10% Tafeng 8 : It s a grocery store transacton data. It contans four month transacton data from November 2000 to February 2001 by T-Feng supermarket. In both IJCAI-15 and Tafeng datasets, each user behavor s logged under four types of actons: clck, add-to-cart, purchase, and add-to-favourte. We consder all the four types as the clck acton. We only keep transactons wth at least fve tems. Ths s because we wll take one tem out for testng, another tem for development. In the remanng three tems, one wll be taken out as a target tem and the two tems wll be used as the neghbor tems. Attentve scores wll be assgned to the neghbor tems. In each of orgnal transactons, we generate data nstances of the format < c,v c > where v c s the target/predctng tem and c s a set of all other tems n the same transacton wth v c. In partcular, n each transacton t, each tme we pck one tem out as a target tem and leave the rest of tems n t as correspondng neghbor/context tems. Subsequently, for each transacton t contanng t tems, we can generate t data nstances. The statstcs of the two transactonal datasets are summarzed n Table 2. For an easy reference, we call (ML-100k, ML-1M, Netflx, Epnons as Group-1 dataset and (IJCAI-15, Ta-Feng as Group-2 datasets. 5.2 Baselnes and State-of-the-art Methods We compared our proposed models aganst several strong baselnes whch are wdely used n general recommendaton task as follows: ItemKNN [41]: It s an tem neghborhood-based collaboratve flterng method. It exploted cosne tem-tem smlartes to produce recommendaton results. Bayesan Personalzed Rankng (MF-BPR [38]: It s a stateof-the-art parwse matrx factorzaton method for mplct feedback datasets. It mnmzes j,k loдσ (u T v j + - ut v j + λ( u 2 + v j + 2 where (u, v j + s a postve nteracton and (u, v j s a negatve sample. 8 WWW 19, May 2019, San Francsco, Calforna USA Sparse LInear Method (SLIM [35]: It learns a sparse tem-tem smlarty matrx by mnmzng the squared loss A AW 2 + λ 1 W + λ 2 W 2, where A s a m n user-tem nteracton matrx and W s a n n sparse matrx of aggregaton coeffcents of neghbor tems. Collaboratve Metrc Learnng (CML [15]: It s a state-ofthe-art collaboratve metrc-based model that utlzes a dstance metrc (.e Eucldean dstance to measure smlartes between users and tems. For far comparson, we learn CML wth BPR loss by mnmzng, j +, j loд(σ (d(u,v j 2 d(u,v j + 2, where d(u,v j + 2 s a squared Eucldean dstance of a postve nteracton (u, v j + and d(u,v j 2 s a squared Eucldean dstance of a negatve sample (u, v j. Neural Collaboratve Flterng (NeuMF++ [12]: It s a stateof-the-art matrx factorzaton method usng deep learnng archtecture. We use a pre-traned NeuMF to acheve ts best performance, and denote t as NeuMF++. Collaboratve Memory Network (CMN++ [7]: It s a stateof-the-art memory network based recommender. Its archtecture follows tradtonal user neghborhood based collaboratve flterng approaches. It adopts a memory network to assgn attentve weghts for other smlar users. Even though our proposed methods do not model the order of consumed tems n the user s purchase hstory (e.g. rgd orders of tems, snce we consder latest s tems as the context tems to predct the next tem, we stll compare our models wth some key sequental models to further show our models effectveness as follows: Personalzed Rankng Metrc Embeddng (PRME [8]: Gven a user u, a target tem j, and a prevous consumed tem k, t models a personalzed frst-order Markov behavor wth two components: d ujk = α v u v j 2 + (1 α v k v j 2, where v u v j 2 s a squared L2 dstance of (u, j, and v k v j 2 s a squared L2 dstance of (k, j. Then PRME s learned by mnmzng BPR loss. PRME_s: It s our extenson of PRME, where the dstance between the target tem j and the prevous consumed tem k s replaced by the average dstance between j and each of prevous s tems: d ujs = α v u v j 2 + (1 α s 1 k s v k v j 2. We use BPR loss to learn PRME_s. Translaton-based Recommendaton (TransRec [9]: It uses frst-order Markov and consders a user u as a translator of hs/her prevous consumed tem k to a next tem j. In another word, prob(j u, k β j d(u +v k v j where β j s an tem bas term, d s a dstance functon (e.g. L1 or L2 dstance. We use L2 dstance because t was shown to perform better than L1 [9]. TransRec s then learned wth BPR loss. Convolutonal Sequence Embeddng Recommendaton (Caser [48]: It s a state-of-the-art sequental model. It uses convoluton neural network wth many horzontal and vertcal kernels to capture the complex relatonshps among tems. The strong sequental baselnes above surpassed many other sequental models such as: TransRec outperformed FMC[39], FPMC [39], HRM [54]; Caser surpassed GRU4Rec [14] and Fossl [10], so we exclude them n our evaluaton.

8 WWW 19, May 2019, San Francsco, Calforna USA Comparson: In the general recommendaton task, we compare our proposed models wth all ten strong baselnes lsted above. In the shoppng basket-based recommendaton task, snce the sequental models often work better than general recommendaton-based models (see Table 3, we only compared our proposed models wth sequental baselnes (PRME, PRME_s, TransRec and Caser. We name general recommendaton baselnes (.e. ItemKNN, BPR, SLIM, CML, NeuMF++, CMN++ as Group-1 baselnes, and call sequental baselnes (.e. PRME, PRME_s, TransRec, Caser as Group-2 baselnes for an easy reference. 5.3 Expermental Settngs Protocol: We adopt the wdely used leave-one-out settng [12, 59], n whch for each user, we reserve her last nteracton as the test sample. If there are no tmestamps avalable n the dataset, then the test sample s randomly drawn. Among the remanng data, we randomly hold one nteracton for each user to form the development set, whle all others are utlzed as the tranng set. Snce t s very tme-consumng and unnecessary to rank all the unobserved tems for each user, we follow the standard strategy to randomly sample 100 unobserved tems for each user. Then, we rank them together wth the test tem [12, 19]. Assgnng tem orders: Sequental models need rgd orders of consumed tems but consumed tems n the same transacton (n IJCAI-15 and TaFeng datasets are assgned the same tmestamp of the transacton contanng these tems. Hence, we assgned the tem tmestamps where the orders of tems are kept as n the orgnal dataset. Ths may gve credts to sequental models but not our methods (because our methods wll use all consumed tems n the same transacton as neghbor tems and our methods do not model the tem orders. Hyper-parameters selecton: We perform a grd search for the embeddng sze from {8, 16, 32, 64, 128} and regularzaton terms from {0.1, 0.01, 0.001, , } n all the models. We select the best number of hops for CMN++ and our SDM from {1, 2, 3, 4}. In NeuMF++, we select the best number of MLP layers from {1, 2, 3}. In our models, we fx the batch sze to 256. We adopt Adam optmzer [18] wth a fxed learnng rate of Smlar to CMN++ and NeuMF++, the number of negatve samples s set to 4. We use one layer perceptron for SDP (more complex datasets may need more than one layer to get better results. We ntalze the user and tem embeddngs usng N (µ = 0, σ = 0.01, and ntalze the edge-weghts layers usng LeCun unform ntalzer (e.g. w (o, w e, w u n Eq. (6, (18, (21, respectvely. In the four datasets used n general recommendaton task (e.g ML-100k, ML-1M, Netflx, Epnons, to avod too many zero paddngs for users wth a smaller number of consumed tems or too many neghbor tems are kept n the memory, whch unnecessarly slow down the model s executon, we follow [49] to lmt the number of neghbor tems usng latest s consumed tems. We search s n {5, 10, 20}. In the two shoppng basket-based recommendaton datasets (.e. IJCAI-15 and TaFeng, snce the maxmum number of tems n a transacton s small (e.g. 13 n IJCAI-15, and 18 n TaFeng, we consder all the other tems n the same transacton wth the target tem as ts neghbor tems. All the hyper-parameters are tuned usng the development dataset. Evaluaton Metrcs: We evaluate the performance of all compared models by two wdely used metrcs: Ht Rato (ht@k, and Normalzed Dscounted Cumulatve Gan (NDCG@k, where k s a truncated number or top-k tem recommendaton. Intutvely, ht@k shows whether the test tem s n the top-k lst or not, whle NDCG@k accounts for the poston of the hts by assgnng hgher scores to the hts at top ranks and downgradng the scores to hts by loд 2 at lower ranks. 5.4 Expermental Results RQ1: Overall results n general recommendaton task: The performance of our proposed models and the baselnes are shown n Table 3. Frst, we observe that SDP sgnfcantly outperformed BPR n all four datasets n Group-1 datasets, mprovng ht@10 from %, and NDCG@10 from %. Even though SDP and BPR shared the same loss functon, the dfference between them s SDP measured a sgned dstance score between a target user and a target tem va a MLP whch modeled a non-lnear nteracton between them, whle BPR went along wth Matrx Factorzaton that exploted nner product. Ths result confrms the effectveness of usng sgned dstance based smlarty over nner product n the general recommendaton task. Second, we compare SDP wth CML. CML worked by tryng to mnmze the squared Eucldean dstance scores between target users and target tems. Our SDP, n another hand, works by mnmzng sgned dstance scores of a non-lnear nteracton (va non-lnear actvaton functons between target users and target tems. We observe that SDP performed better than CML n all Group-1 datasets, mprovng ht@10 from %, and NDCG@10 from %. On average, SDP mproved ht@10 by 7.5% and NDCG@10 by 9.5% compared to CML. Our SDP even gan compettve results compared to NeuMF++ and CMN++. On average, SDP s just slghtly worse than NeuMF++ and CMN++ by -2.67% for ht@10, and -1.68% for NDCG@10. All of these results show the effectveness of usng sgned dstance n our SDP model. Next, we compare SDM wth neghborhood-based baselnes. Both SLIM and tem-knn used prevously consumed tems of a user to make the predcton for the next tem. SDM sgnfcantly outperformed both baselnes, mprovng ht@10 from % and NDCG@10 from % compared wth SLIM. It s an obvous result because the neghborhood-based baselnes barely measured lnear smlartes between the target tem and the user s consumed tems. In contrast, our SDM produced sgned dstance scores and assgned personalzed metrc-based attenton weghts to each of consumed tems that contrbute to the target tem. We then compare SDM wth CMN++ and NeuMF++. SDM outperformed CMN++ n all Group-1 datasets, mprovng ht@10 from % and NDCG@10 from %. On average, t mproves ht@10 by 18.63% and NDCG@10 by 32.84% compared to CMN++. Ths result shows the effectveness of our personalzed metrc-based attenton wth sgned dstance and tem-based neghborhood desgn over the tradtonal nner product-based attenton n a user-based neghborhood desgn n CMN++. SDM also outperformed NeuMF++, mprovng ht@10 from %, and NDCG@10 from %. On average, n all Group-1 datasets, SDM outperformed all the baselnes n General Recommenders (Group 1, mproved ht@10 by 18.13% and NDCG@10 by 32.58% compared to the best baselne n Group 1.

9 Sgned Dstance-based Deep Memory Recommender WWW 19, May 2019, San Francsco, Calforna USA Table 3: General Recommendaton Task: Overall performance of the baselnes, and our proposed SDP, SDM, and SDMR on four datasets. The last four lnes show the relatve mprovement of the SDM and SDMR over the best baselne method n General Recommenders (Group 1 and Sequental Recommenders (Group 2, respectvely. Method type Method ML-100k ML-1M Netflx Epnons ht@10 NDCG@10 ht@10 NDCG@10 ht@10 NDCG@10 ht@10 NDCG@10 General Recommenders (Group 1 Sequental Recommenders (Group 2 Ours Compared to Group 1 Compared to Group 2 Item-KNN SLIM MF-BPR CML NeuMF CMN PRME PRME_s TransRec Caser SDP SDM SDMR Imprv. of SDM 14.54% 26.51% 11.93% 32.13% 11.71% 29.32% 34.35% 42.34% Imprv. of SDMR 11.65% 63.44% 11.11% 49.77% 13.24% 53.20% 32.71% 54.38% Imprv. of SDM 4.24% 8.21% -1.21% -3.63% 8.35% 8.91% 4.36% 9.24% Imprv. of SDMR 1.61% 39.80% -1.94% 9.24% 9.83% 29.02% 3.09% 18.49% Table 4: Shoppng basket-based Recommendaton Task: Overall performance of the baselnes, and our proposed models on two datasets. The last two lnes show the relatve mprovement of the SDM and SDMR over the best baselne. Method IJCAI-15 Ta-Feng ht@10 NDCG@10 ht@10 NDCG@10 PRME PRME_s TransRec Caser SDP SDM SDMR Imprv. of SDM 14.49% 6.78% 3.86% 9.48% Imprv. of SDMR 21.74% 25.42% 0.80% 39.40% Fnally, we look at the performance of SDMR model, whch s the proposed fuson of SDP and SDM. Compared to SDM, our SDMR nsgnfcantly downgrades SDM on ht@10 measurement wth a very small amount, but t does help a lot n refnng the rankng of tems and boostng NDCG@10 results. As shown n Table 3, SDMR mproved from % for NDCG@10, and by 17.37% for NDCG@10 on average compared to SDM n Group-1 datasets. SDMR also surpassed all the methods n Group 1. On average, SDMR mproved ht@10 by 17.18% and NDCG@10 by 55.20% compared to the best model n Group 1. We also compared our models wth some strong sequental models n Table 3. Sequental models exploted consumng tme of tems and model ther rgd orders, whch often lead to a much mproved performance compared to general recommendaton models n Group- 1 baselnes. As such, compared to the best sequental baselne model, on average, SDM mproves ht@10 by 3.94% and NDCG@10 by 5.68%, and SDMR mproves ht@10 by 3.15% and NDCG@10 by 24.14% compared to the best sequental model reported n Table RQ2: Understandng our mult-hop personalzed metrc-based attenton desgn? In the prevous secton, we see that our proposed models outperformed many strong baselnes n sx dfferent datasets of the two dfferent recommendaton problems. In ths part, we explore why dd we acheve those better results? As attenton s all you need [53], the core reason brought us an surpassed performance accredt to the metrc-based attenton whch are further refned va mult-hop desgn. Therefore, we want to explore quanttatvely and qualtatvely how our attenton wth mult-hop desgn worked by answerng two smaller research questons: ( what dd our metrc-based attenton wth mult-hop desgn learn?, ( dd the metrc-based attenton wth mult-hop desgn mprove recommendaton results? Wthout a specal menton, snce our SDMR model just learned a combnaton between SDP and SDM wthout re-learnng the learned-already parameters n SDP and SDM, we use our SDM model n ths secton to understand how attenton wth mult-hop desgn works. Note that we conduct ths analyss for ML-100k only due to space lmtaton and the avalablty of moves genre n ML-100k (for vsualzaton n Fgure 7. What dd our metrc-based attenton wth mult-hop desgn learn? To answer ths research queston, we frst measure the pont-wse mutual nformaton (PMI between two certan tems j and k as: P (j, k PMI (j, k = loд P (j P (k (23 where P (j, k s the jont probablty between two tems j and k, whch shows how lkely j and k are co-preferred (P (j, k = #(j,k D,

WWW 19, May 2019, San Francsco, Calforna USA 0.0 8 16 32 64 embeddng sze 128 0.6 0.4 0.4 0.4 hop 1 hop 2 hop 3 hop 4 0.2 0.0 (a ML-100K. 8 16 32 64 embeddng sze 128 hop 1 hop 2 hop 3 hop 4 0.2 0.0 8 (b ML-1M.

8 ht@10 0.4 ht@10 ht@10 0.6 128 (e IJCAI-15. (d Epnons. 0.4 hop 1 hop 2 hop 3 hop 4 0.2 0.0 8 16 32 64 embeddng sze 128 (f TaFeng. 0 2 PMI score (a Hop 1. 4 0 2 PMI score (b Hop 2. 4 1.00 0.75 0.50 0.

0.75 0 2 PMI score (c Hop 3. 4 Attenton Scores 0 2 PMI score 4 Personalzed Weghts (d Hop 4.

The Pearson correlaton between two scores ncreases when h ncreased. where D denotes a collecton of all tem-tem pars, and D refers to the total number of tem-tem co-occurrence pars n D.

Intutvely, a PMI score between two tems shows how lkely the two tems are copurchased/co-preferred.

Now, gven a target tem j and the user s neghbor/context tems k, SDM-h wth h hops wll assgn attentve scores for all (j, k pars. We also get PMI scores (from Eq. (23 of (j, k pars.

In Fgure 6, the Pearson correlaton between PMI scores and attentve scores are 0.059, 0.097, 0.143, and 0.146 for SDM-1, SDM2, SDM-3 SDM-4, respectvely.

10 WWW 19, May 2019, San Francsco, Calforna USA embeddng sze hop 1 hop 2 hop 3 hop (a ML-100K embeddng sze 128 hop 1 hop 2 hop 3 hop (b ML-1M embeddng sze hop 1 hop 2 hop 3 hop (c Netflx embeddng sze hop 1 hop 2 hop 3 hop embeddng sze ht@ ht@ ht@10 hop 1 hop 2 hop 3 hop ht@ ht@10 ht@ (e IJCAI-15. (d Epnons. 0.4 hop 1 hop 2 hop 3 hop embeddng sze 128 (f TaFeng. 0 2 PMI score (a Hop PMI score (b Hop Attentve score Attentve score Attentve score Attentve score Fgure 5: Comparson of varyng the number of hops regardng dfferent embeddngs szes n the sx datasets PMI score (c Hop 3. 4 Attenton Scores 0 2 PMI score 4 Personalzed Weghts (d Hop 4. Mult-hop Memory Fgure 6: ML-100K: Scatter plots of PMI scores and attentve scores generated by SDM wth h hops (h={1, 2, 3, 4} from left to rght. The red lnes are the lnear trend lnes. The Pearson correlaton between two scores ncreases when h ncreased. where D denotes a collecton of all tem-tem pars, and D refers to the total number of tem-tem co-occurrence pars n D. Smlarly, P (j and P (k are the probabltes of the tem j and k appears n #(j #(k D, respectvely (e.g. P (j = D, P (k = D. Intutvely, a PMI score between two tems shows how lkely the two tems are copurchased/co-preferred. The hgher the PMI score between j and k s, the more lkely the user wll purchase j f k was purchased before. We denote SDM-h s the SDM model wth h hops. Now, gven a target tem j and the user s neghbor/context tems k, SDM-h wth h hops wll assgn attentve scores for all (j, k pars. We also get PMI scores (from Eq. (23 of (j, k pars. Next, we plot a scatter plot of PMI scores and attentve scores for all (j, k pars to see the relatonshp between the two scores. Our results for ML-100k dataset s shown n Fgure 6. In Fgure 6, the Pearson correlaton between PMI scores and attentve scores are 0.059, 0.097, 0.143, and for SDM-1, SDM2, SDM-3 SDM-4, respectvely. It ndcates that as we ncrease the number of hops n SDM model, PMI scores and attentve scores are more postvely correlated. In another word, as we ncrease number of hops, our metrc-based attenton wth mult-hop desgn wll assgn hgher weghts for co-purchased tems, whch s what we desre. Furthermore, scatter plots n Fgure 6a presents that there s a hgh densty of ponts wth small attentve scores. Ths ndcates that attenton n SDM-1 s dstrbuted to several tems (whch s somewhat close to equally focusng on neghbor tems. However, when we ncrease the number of hops h, the densty spreads up to the top, ndcatng that the model tends to gve hgher attenton to some neghbor tems, whch can be more relevant than others. Ths observaton s consstent wth learnng to attend n [2, 58]. Dd the metrc-based attenton wth mult-hop desgn mprove recommendaton results? We answer ths research queston by showng the results of SDM model when varyng number of hops h from {1, 2, 3, 4} wth dfferent embeddng szes and vsualze attenton scores of SDM-h wth a random observaton. Varyng number of hops wth dfferent embeddng szes: The performance of SDM-h regardng ht@10 wth h from {1, 2, 3, 4} and embeddng sze from {8, 16, 32, 64, 128} s presented n Fgure 5. We see that more hops tend to gve addtonal mprovement n predctng tem set of consumed tems acton romance acton romance acton acton Fgure 7: Mult-hop Attenton vsualzaton. all 6 datasets, except n Tafeng dataset where SDM wth more hops over-ftted. In ML-100k and ML-1M, the optmal number of hops are 3 or 4. In Netflx, SDM wth 3 hops performed well. In Epnons and IJCAI-15, SDM-4 tends to acheve better results. Overall, the selecton of the number of hops depends on the dataset complexty, and t vares from datasets to datasets. Attenton Vsualzaton: Lastly, to vsualze how the personalzed metrc-based attenton wth mult-hop desgn works, we chose one user from ML-100K data. The learned weghts at each hop of SDM s shown n Fgure 7. The target tem n ths example s an acton move called Fre Down Below (1997. The frst two hops of SDM assgned hgh weghts to two romance moves, and the lowest score to the acton move Money Talks (1997. The 3rd-hop and 4th-hop attenton refned the weghts of moves to better reflect the correlatons and smlartes w.r.t the target move. At last, Money Talks (1997 was assgned wth the hghest weght 0.386, and the total weghts of two romance moves decreased to less than 0.2. Ths result shows the effectveness of our mult-hop SDM model. 6 CONCLUSION In ths paper, we have studed the top-k recommendaton problem n a dstance metrc learnng perspectve. Dfferent from the prevous works, we have consdered two ndependent sgned dstance models for measurng user-tem smlartes and for tem-tem smlartes respectvely va deep neural networks. Extensve experments have been performed on sx real-world datasets n general recommendaton and shoppng basket-based recommendaton task. We presented that our proposed SDMR outperformed ten baselnes n all two recommendaton tasks. To an extenson, future works can ntegrate poston embeddngs [53] n our models, whch gve our models a sense of whch poston we are dealng wth, whch can help further mprove our model s performance when rgd orders of tems are avalable.

IN recent years, recommender systems, which help users discover

IN recent years, recommender systems, which help users discover Heterogeneous Informaton Network Embeddng for Recommendaton Chuan Sh, Member, IEEE, Bnbn Hu, Wayne Xn Zhao Member, IEEE and Phlp S. Yu, Fellow, IEEE 1 arxv:1711.10730v1 [cs.si] 29 Nov 2017 Abstract Due