Collaborative Filtering Ensemble for Ranking

Size: px

Start display at page:

Download "Collaborative Filtering Ensemble for Ranking"

Ashlynn Nichols
5 years ago
Views:

1 Collaboratve Flterng Ensemble for Rankng ABSTRACT Mchael Jahrer commo research & consultng 8580 Köflach, Austra Ths paper provdes the soluton of the team commo on the Track2 dataset of the KDD Cup 20 [3]. Yahoo Labs provdes a snapshot of ther musc-ratng database as dataset for the competton, consstng of approxmately 62 mllon ratngs from 250k users on 300k tems. The dataset ncludes herachcal nformaton about the tems. The goal of the competton s to dstngush beteen Hgh rated and Not rated tems of a user. The ratng scale s dscrete and ranges from 0 to 00, whle a Hgh ratng s a ratng 80. The error measure s the percent of false rated tracks over all users, known as the fractons of msclassfcatons. The task s to mnmze ths error rate, hence the rankng should be optmzed. Our fnal submsson s a bl of dfferent collaboratve flterng algorthms enhanced, wth basc statstcs. The algorthms are traned consecutvely and they are bled together wth a neural network. Each of the algorthms optmzes a rank error measure. Categores and Subect Descrptors H.2.8 [Database Management]: Database Applcatons- Data Mnng General Terms Applcaton Keywords Collaboratve Flterng, Ensemble Learnng, KDD Cup, Ratng Predcton, Rankng, Musc Ratngs. INTRODUCTION In the KDD Cup 20 challenge we work wth the Yahoo! Musc dataset [3]. The datasets of both tracks are dsont, sowe havedfferentusers on thetrack2. TheTrack2dataset s sgnfcantly smaller to the one n Track.. Notaton We consder the dataset as sparse matrx R = [r u], where we use the letter u for users and for tems durng ths wrteup. Bold letters are used for matrces and vectors, non bold for scalars. The set of tems s denoted by I and I stands for the total number of tems, I =296,. The set ofusersswrttenas U, whle U s thetotal numberofusers, U =249,02. Predctons for user/tem pars are denoted as ˆr u. We have no nformaton about the ratng date. The set of tems, whch s rated by user u s I(u). The set of users, whch rate tem s U(). Addtonal to the ratng data there s taxonomy nformaton avalable. Taxonomy means the relatonshp between Andreas Töscher commo research & consultng 8580 Köflach, Austra andreas.toescher@commo.at the tems wthn the musc doman. For example track x belongs to album y, and album y belongs to artst z. For a more smple and holstc vew we calculate tem-to-parent and tem-to chld tables. We use the notaton of P() for the parents of tem and C() for the chldren of tem. Parents and chldren are tself tems. The total number of chldren s,203,596 (same number of parents). For the Track2 the goal s to mnmze the error rate E(θ) on the test set. The letter θ ndents the tranable weghts/parameter of the predcton model. E(θ) = 00% 0, r u 80 AND ˆr u(θ) = H 0, unrated AND ˆr u(θ) = Lo R test u, R test, else () The error assessment s done as follows. For each user we should predct ratngs for 6 tems. Ths can be any numerc score, scalng does not matter. Then, the evaluaton machne sorts the scores and assgns the 3-hghest scores to class Hgh and the others to Low. Equaton shows how ths s calculated. If we predct all tem scores for the partcular user correctly, t would result n a 0% error rate. 0% error mples that we separate the Hgh rated tems perfectly from all others. Here we can fnd a sketch of the notaton for model parameters. p u s a user-depent feature and q s an temdepent feature. All features are stored n matrces, e.g. all the user features p u are stored n the matrx p R Fx U and the columns are the F-dmensonal features. The feature values are typcally ntalzed wth small random numbers and they are traned wth stochastc gradent descent from the data r u. As learnng rate we use the constant η and λ for L2-regularzaton. In a basc matrx factorzaton model a predcton ˆr u s calculated by the dot product of the user and the tem feature ˆr u = p T uq..2 The Dataset The provded dataset for Track2 [3] conssts of 62,55,438 ratngs from 249,02 users on 296, tems. The range of the ratng value s between 0 and 00. An tem can be a track, an album, an artst or a genre. Most of the ratngs are gven to tracks (44%) followed by artsts (3%), albums (9%) and genres (6%). The dstrbuton of types wthn all tems are gven by 224,04 tracks, 52,829 albums, 8,674 artsts and 567 genres. By consderng the data as sparse user x tem matrx R = [r u] the fll-rate of 0.08% s very low value. For example n the Netflx Prze dataset [] the fll-rate of the ratng matrx s about %..3 The Tranng of the Models Our soluton s a bled set of predctors. We tran each

2 predctor n two steps. In the frst step we remove a valdaton set from thedataset andtran the model. Then we store the predctons of the valdaton set. In the second step we tran on all avalable data wth the same meta parameters as n the frst step, such as learnng rate η, regularzaton λ, number of epochs etc. and store the predctons of the test set. The valdaton predctons from the frst step are used to tran the bler. The target of all models s the mnmzaton of the error rate on the valdaton set..4 Rankng and Samplng In the KDD Cup 20 Track2, the key dea for collaboratve flterng s to ntroduce samplng. As mentoned n the ntroducton, the error measure s the fracton of msclassfcatons. The ratngs are n the range from 0 to 00. In the competton, the Hgh ratng s defned as ratng 80. For a partcular user we should dstngush between Hgh rated tems and unrated ones. In the test set, the task s to predct scores for 6 tems per user. We know that 3 of these tems are rated Hgh, the others are Not rated by the user, but sampled wth the Item-Hgh probablty. Item-Hgh probablty means that the drawn tem occurs proportonal wth ts number of Hgh-ratngs. It s gven n equaton 2. prob H() = R 80 u U() {, r u 80 0, else In contrast to the hgh-probablty, we can sample an tem accordng to ther popularty. Equaton 3 denotes ths probablty. prob() = U() R.5 Drawng the Valdaton Set The valdaton set of KDD Cup 20 Track s gven by a separate text fle. For Track2 there s no pre-defned valdaton set avalable. Therefore, we generate one by ourselves. The generaton procedure s based on the publc knowledge of the test set generaton. We terate over all users and pck those users, who have 3 Hgh ratngs on the tranng set. For those users, 3 random Hgh ratngs and 3 un-rated tems wth the probablty of prob H() are chosen. We up wth 09,360 users and 656,60 ratngs beng n the valdaton set. For the fnal blng stage t s mportant to use the same valdaton set for every traned predctor. For all models n chapter 2 we report error rate values on the valdaton set. In table we lst three baselne error rates. A low error rate of a sngle predctor s not necessarly needed n order to bl well wthn the ensemble. An average user has 250 ratngs, an average tem 20 ratngs. 2. ALGORITHMS Here we lst the set of collaboratve flterng models. All models are traned on the ratng data drectly. All dfferent models get lsted here wth ther correspondng explanaton. The formula for predctng one user/tem par s also added to each of the algorthms. For completeness we add the error rate value on the valdaton set of each of the models wth the used meta-parameters. The (2) (3) predctor error rate global mean 50.0% tem count 42.38% user count 50.0% Table : Baselne predctons. The error rate of global mean and user count s random (50%), because there s no dfference n the predcted score for a user. Item count s the number of ratngs per tem n the dataset and t performs a lttle better than random. lsted numbers are taken from a predctor whch s actually n our fnal bl. 2. Basc Taxonomy The dataset contans taxonomy nformaton. Items consst of tracks, albums, artsts and genres. The nformaton whch track belongs to whch album, the connecton from the albums to the artsts and from artsts to genres was gven. The dea of the basc taxonomy models s to produce predctors whch bl well by usng taxonomy nformaton whch s not captured by the followng matrx factorzaton models. In total we use 4 dfferent basc taxonomy predctors:. r u = the number of users whch gave a hgh ratng to tem (42.42%) 2. r u = the number of users whch rated tem (42.36%) 3. r u = the number of hgh ratngs the user u has gven to tracks ncluded n the album of the gven track (4.05%) 4. r u = the number of hgh ratngs the user u has gven to tracks belongng to the artst(s) of the gven track (36.93%) 5. r u = the number of hgh ratngs the user u has gven to tracks belongng to the genre(s) of the gven track (25.33%) 6. r u = the number of ratngs the user u has gven to tracks ncluded n the album of the gven track (38.50%) 7. r u = the number of ratngs the user u has gven to tracks belongng to the artst(s) of the gven track (33.45%) 8. r u = the number of ratngs the user u has gven to tracks belongng to the genre(s) of the gven track (25.23%) 9. r u = the number of hgh ratngs the user u has gven to a parent (album, artst, genre) of the gven track (20.24%) 0. r u = the number of ratngs the user u has gven to a parent (album, artst, genre) of the gven track (8.43%). r u = the number of hgh ratngs the user u has gven to parents (album, artst, genre) and all chldren of those parents (dstance 2) of track (24.27%) 2. r u = the number of ratngs the user u has gven to parents (album, artst, genre) and all chldren of those parents (dstance 2) of track (25.55%) 3. r u = the number of hgh ratngs to tem and parents of tem (47.29%)

3 4. r u = the number of ratngs to tem and parents of tem (47.29%) The fnal predctons are gven by addng n order to avod a zero and usng the logarthm, whch leads to: ˆr u = log( r u +) (4) 2.2 Matrx Factorzaton Models Ths secton s about dfferent matrx factorzaton technques. The dea of matrx factorzaton s to proect users and tems to a low-dmensonal space. Soeach user and each tem s represented by a latent feature vector R F. The dot product of them p T uq s the predcton SVD (regresson) The SVD model s a smple matrx factorzaton of the ratng matrx. The tranng s based on regresson wth a quadratc error functon and therefore, t does not drectly optmze the rankng score. Predctons of a user/tem par are gven by the followng formula: ˆr u = p T uq (5) Ths s one of the most popular models n collaboratve flterng snce the Netflx Prze [] n When usng gradent descent as learnng algorthm, tranng tme grows lnearly wth the number of ratngs R, predcton tme of one user/tem par can be done n constant tme O(), hence very fast. Wth F =500, η=0.0006, λ=.5 and 20 teraton epochs we get error rate=5.0% on the valdaton set SVD The SVD explaned here s the same model as the SVD (regresson) beforehand. The predcton s a dot product, gven by ˆr u = p T uq (6) The dfference s how to do stochastc gradent descent by smultaneously optmzng the rankng error. Our frst approach was to use zero-samplng, the dea s to tran a SVD (regresson) on the ratng data and on some artfcal zero-ratngs for randomly sampled tems. Ths was less effectve, but better than the regresson one. However, after a few addtonal experments we found an effectve way to drectly learn the rankng from the data. The dea s to learn the ratng dfference of two tems for a gven user u. The frst tem s the known ratng, the other tem 0 s drawn from all tems I, not rated by u. Hence we have two target values r u and r u0. The value of r u0 can be any constant, we set t for example to 0. The complete algorthm for tranng the SVD, whch optmzed the rankng, s shown n Algorthm. The algorthm we use s very smlar to the one n [7], whch s rankng optmzaton usng the ordnal regresson loss. Agan, we terate over all elements of the sparse ratng matrx R (lne 3). The man dfference to standard stochastc gradent descent s that thetranng s done on pars, here are tem pars (, 0). The second tem 0 (not rated by user u) s sampled accordng to the Item-Hgh probablty from all tems, lne 4 n Algorthm. The next steps are to predct the ratngs ˆr u, ˆr u0 (lne 5, lne 6). The target of the sampled tem s set to 0 (lne 7). The most mportant step s to learn the dfference between the targets and the actual predctons. The error s shown n lne 8. Updates of the features are done feature-wse (lne 0 to lne 3) Input: Sparse ratng matrx R R U x I = [r u] Tunable: Learnng rate η, feature sze F Intalze user weghts p R Fx U and tem weghts q R Fx I wth small random values whle error on valdaton set decreases do forall u, R do draw an tem 0, not rated by u, from all tems I accordng to the probablty prob H() ˆr u p T uq ˆr u0 p T uq 0 r u0 0 e (ˆr u ˆr u0 ) (r u r u0 ) for k =...F do c p uk p uk p uk η (e (q k q 0 k) q k q k η (e c) q 0 k q 0 k η (e ( c)) Algorthm : Pseudo code for tranng a SVD for rankng on ratng data. Wth F=000, η= and 25 teraton epochs we get error rate=5.84% on the valdaton set wth the SVD model AFM Ths model s called asymmetrc factor model because t uses only tem-depent parameters. It was frst mentoned by Paterek n [5]. It learns a dfferent knd of data varablty compared to the smple SVD model, hence t bls well wthn the ensemble. The followng equaton s the predcton of a user/tem par. In a smplfed vew, the predcton s agan a dot product of the tem feature q and the user feature (the rght sde). But the user feature s gven by the set of tems I(u) that user u has rated. Therefore the user s expressed purely by hs rated tems. The features whch are learned from the data are bult together n a dfferent way compared to the pure SVD model, hence a lnear combnaton of an SVD and AFM results mostly n a more accurate model. ˆr u = µ u +µ +q T q (0) I(u) (7) The two matrces q and q (0) have both the sze I x F. The term s used for normalzaton purpose. User and tem bases are µ u and µ. Tranng of the AFM wth stochastc gradent descent s more effectve by teratng over all users u U and make a batch update on the asymmetrc part q 0. Algorthm 2 sketches the tranng process. In lne 4 we ntalze the update vector x R F wth zero. Lne 6 sums the updates and n lne 25 the update s appled to the asymmetrc features q (0). The target value α of the sampled tem 0 can be any number.

4 Input: Sparse ratng matrx R R U x I = [r u] Tunable: Learnng rate η, sampled target constant α, feature sze F Intalze tem weghts q R Fx I and asymmetrc tem weghts q (0) R Fx I wth small random values whle error on valdaton set decreases do forall u U do x 0 y vrtual user feat I(u) q(0) forall I(u) do draw an tem 0, not rated by u, from all tems I accordng to prob H ˆr u µ u +µ +y T q ˆr u0 µ u +µ 0 +y T q 0 r u0 α e (ˆr u ˆr u0 ) (r u r u0 ) µ u µ u η e µ µ η e µ 0 µ 0 η e for k =...F do x k x k +e (q k q 0 k) for k =...F do q k q k η (e y k ) q 0 k q 0 k η (e ( y k )) forall I(u) do for k =...F do q (0) k q(0) k η x k Algorthm 2: Pseudo code for gradent descent tranng an AFM for rankng on ratng data. Wth F =000, η= , α=-50 and 30 teraton epochs we get error rate=6.64% on the valdaton set AFM flpped Agan the same asymmetrc dea, but flpped to the tem sde. We have a fxed user feature, but the tem s expressed by the set of users U() whch rated ths tem. Thus, the tem feature s buld from a normalzed sum of rated user features. User and tem bases are µ u and µ. ˆr u = µ u +µ +p T u U() U() p (0) (8) The two matrces p and p (0) have both the sze U x F. Accordng to Algorthm 2 we use the same schema, but flpped to the other sde (tran tem-wse, etc.). Wth F =50, η= , α=-50 and 50 teraton epochs we get an error rate=7.050% on the valdaton set ASVD The ASVD model combnes the basc SVD approach wth the dea from the AFM. The user s descrbed by the userdepentfeaturep u andthenormalzed sumofratedtems. Ths model s exactly thesame as SVD++from Y.Koren[4]. ˆr u = µ u +µ +q T p u + I(u) q (0) (9) The two matrces q and q (0) have both the sze I x F, matrx p has the sze U x F. Stochastc gradent descent s appled n order to learn the features from the data. Tranng s smlar to Algorthm 2, but the samplng of the second tem 0 s done by pckng a random tem. Wth F=000, η= , α=0 and 35 teraton epochs we get an error rate=5.949% on the valdaton set ASVD flpped As n the secton above, we apply the asymmetrc dea from the ASVD model to the tem sde. Now, the user s descrbedbyp u andwe haveatthetemsde(rghtbrackets) the tem-depent feature q and the normalzed sum of rated user features. ˆr u = µ u +µ +p T u q + U() U() p (0) (0) The two matrces p and p (0) have both the sze U x F, matrx q has the sze I x F. As n Algorthm 2 we use the same schema, but flpped to the other sde (tran tem-wse, etc.). Wth F =2000, η= , α=-50 and 30 teraton epochs we get an error rate=5.63% on the valdaton set ASVD flpped taxonomy The ASVD flpped model can be further mproved by usng the taxonomy nformaton. Items are categorzed n 4 dfferent types: tracks, albums, artsts and genres. Herarchcal relatonshps of them are gven by the taxonomy nfo, whch comes wth the dataset. From the taxonomy we buld chld and parent tables. C() s the set of related chldren of tem. P() s the set of parents of the tem. In the model we ext the tem part wth two new features, the chld features q () and the parent features q (2). We descrbe the tem by ts bag of chldren and by the bag of parents n the same way as we dd t n the asymmetrc part q (0). Here s shown the predcton formula for a user/tem par: ˆr u = µ u +µ +p T u(q + P() P() U() U() q () + C() C() p (0) + q (2) ) () The matrces q, q (0), q (), q (2) have the sze of I x F, matrx p has the sze U x F. The parameters are learned accordng to Algorthm 2, but flpped to the other sde (tran tem-wse, etc.). Wth F =3000, η= , α=-50 and 40 teraton epochs we get an error rate=4.78% on the valdaton set. Ths s our most accurate sngle model ASVD flpped rated taxonomy Taxonomy nformaton was very helpful n the ASVD flpped taxonomy model, furthermore we ntroduce a new parameter q (3) whch learns structure from rated parents.

5 The set P() s the set of parents, now we use the set P u() as parents rated by user u. ˆr u = µ u +µ +p T u(q + P() Pu() P() P u() U() U() q () + C() q (3) ) C() p (0) + q (2) + (2) Wth F =000, η= , α=-63 and 45 teraton epochs we get an error rate=4.277% on the valdaton set. 2.3 Neghborhood models 2.3. Item-Item KNN In order to predct the ratng ˆr u, a k-nearest neghbor algorthm selects K ratngs wth the hghest correlatons to the tem. Therefore, we ntroduce theset of tems I(u,,K) whch conssts of rated tems from user u wth the K-hghest correlatons to tem. I(u,,K) I(u). Here we have the fnal predcton formula for the model Item-Item KNN model. ˆr u = I(u,,K) ruĉ I(u,,K) ĉ (3) Our tem-tem neghborhood model uses a sgmod functon to map the correlatons c to ĉ by ntroducng two new parameters σ (scale) and γ (offset). ĉ = 0.5 tanh(σ c +γ)+0.5 (4) In general we use the Pearson correlaton between two tems, by dong ths, the model s called Item-Item KNN Pearson. Another, more effectve way of computng the correlaton between two tems s to use the feature dstance. The equaton below denotes the correlaton c between two tems and. We dub the model Item-Item KNN SVD feat. c = F k= (q k q k ) 2 F k= q k F k= q k 2 (5) Two models of the Item-Item KNN wth SVD features are n the fnal bl. For those, we calculate the KNN on the resduals of the SVD. We get good results, by usng an SVD wth F=00 and σ=0.62 γ=-2.6 K=27 the error rate on the valdaton set s 5.66% User-User KNN The User-User KNN s the flpped verson of the Item- Item KNN. The only dfference s that users and tems are changed. Snce we have approxmately the same number of tems and users, the User-User KNN has smlar computatonal complexty to the tem one. The only model, whch s n the bl s based on Pearson correlatons between users. Hence the model s called User-User KNN Pearson. Wth metaparameters σ=23.3 γ=-4. and K=507 the error rate on the valdaton set s 8.35%. 2.4 Restrcted Boltzmann Machnes The use of Condtonal Restrcted Boltzmann Machnes for collaboratve flterng s descrbed n [6]. We use a very smlar setup, but adopted the methodology for optmzng the gven rankng error measure. The set of tems rated by user u s denoted as I(u). Ths leads tothe set of unrated tems I I(u) for user u. For every user we choose a subset of the unrated tems Ĩ(u) I I(u) wth the same sze as the set of rated tems Ĩ(u) =. One mportant thng s to draw the tems from the set of unrated tems Ĩ(u) accordng to the probablty of an tem beng rated hgh. Every tem s represented va two bnary vsble unts v () and v (2). The unt v (2) s, f the user rated the tem I(u), otherwse t s 0. The unt v () represents the unrated tems and s, f Ĩ(u) otherwse t s 0. For the condtonng of the hdden states we use the tems rated wth an hgh ratng, ths set of tems s denoted as Î(u) and s a subset Î(u) I(u) of all ratngs rated by user u. The used RBM topology can be seen n Fgure. The probabltes of the vsble and hdden unts for beng hgh are gven by: p(v (l) = h) = p(h = v) = σ b + exp(a (l) + hw(l) ) 2 l= exp(a(l) + hw(l) ) (6) 2 I(u) Ĩ(u) l= σ(x) = w (l) v(l) + d Î(u) (7) +e x (8) We ntalze the bases of the vsble unts a (l) to the logarthm of the probabltes of the correspondng unt of beng hgh. Thebases ofthehddenuntsb are ntalzed tozero. The weghts w and d are ntalzed to values drawn from a zero mean Gaussan dstrbuton wth a small standard devaton of Wth 300 hdden unts we acheve an error rate=0.% on the valdaton set. For every tranng case we actvate only a small set of vsble unts, the set of rated I(u) and the sampled set of unrated tems Ĩ(u). Every vsble unt belongng to an tem not ncluded n the set Î(u) = Ĩ(u) I(u) must not be reconstructed for the tranng of user u. Both sets have the same sze Ĩ(u) = and every tem s represented va 2 vsble unts. Hence there s the need to reconstruct 2 Î(u) vsble unts, whch s less than the total number of tems I. It s also possble to use a sngle vsble unt for every tem beng for rated and 0 for unrated. The problem s that ths setup would requre a reconstructon of every vsble unt, whch equals I reconstructed vsble unts for every tranng user u. On the gven dataset, where every user only rated a small fracton of all tems, ths ncreases the computatonal costs. By usng two vsble unts per tem t s not needed to reconstruct all tems, only the rated tems and a subset of unrated tems have to be reconstructed.

6 Fgure : A Condtonal Restrcted Boltzmann Machne. The hdden unts are condtoned on the tems rated hgh by the gven user. 3. BLENDING Fortrack2therewas nopredefnedholdoutset. InSecton.5 we descrbe how we choose our fxed hould out set. Ths holdout set s desgned to resemble the dsclosed test set and was fxed for all traned models. In order to produce one predctor, the holdout set s removed from the data. The algorthm trans the model on the remanng data. Then predctons for the holdout set are produced. Afterwards we nsert the holdout set n the tranng set, whch results n the orgnal tranng set. Now the model s traned agan on the orgnal tranng set and produces predctons for the test set. Hence every predctor p n = [p () n p (2) n ] consts of predctons on the holdout set p () n and of predctons for the test set p (2) n. For the holdout set the true rankng s known. Ths leads to a supervsed learnng problem. Weuse asngle layer neural network wth sgmod unts on the hdden layer and a lnear readout. The network structure can be seen n Fgure 2. The network s traned usng parwse rankngs, the dea for ths methodology orgnates n [2]. The neural network used s vsualzed n Fgure 2. The output s gven by: o = k h k = tanh w (2) k h k = k ( n w () kn pn ) w (2) k tanh ( n w () kn pn ) (9) (20) The dfference between two outputs o and o s denoted as o = ζ(o o ) (2) where ζ s a meta parameter for scalng. In order to tran the weghts we consder parwse rankngs. We choose two ndces and, where ndex s ranked hgher than. The cost functon C = o +log(+e o ) (22) s descrbed n [2]. It s low, f the outputs of the network predct the correct order, meanng that o s hgher than o. The dervatves of the parwse error between the tranng sample and, where should be ranked hgher than s gven by: ( C e o ) = ζ(h w (2) k h k ) +e o (23) k Fgure 2: The structure of the neural network used for blng. The N nputs p n are normalzed to mean zero and standard devaton. The last nput s constantly. The hdden layer conssts of K sgmod unts, whle the readout s purely lnear. C w () kn = ζw (2) ( ) ( ) k ( h 2 k )p n ( h 2 e o k)p n +e o (24) The tranng s done usng stochastc gradent descent and 32 fold cross valdaton. The weghts are ntalzed to random values drawn unformly between 0.4 and 0.4. The parameter ζ s set n the range from to 2, for the fnal bls we used ζ =.69. The number of hdden unts was set to 30. Table 2 contans all predctors used n the fnal bl, whch produced an error of % on the leaderboard. 4. GENERAL THOUGHTS Samplng s the key dea when optmze a rank error measure wthn a collaboratve flterng algorthm. Compared to pure stochastc gradent descent t works much better. We found out that regularzaton s not that mportant, hence we skp t for most of the models. Durng the competton, our goal was to try many dfferent models, whch can explan the varablty of the data. The accuracy of matrx factorzaton models can be easly mproved by addng more features. We have a much hgher relatve mprovement n the error rate compared to RMSE when doublng the number of features. Hence, some of our best models act wth over 000 features. We dd experments wth samplng tems of the same type, e.g. f processng a track then also a track s sampled. However the results only got worse. In general, taxonomy nformaton helps n gettng an accurate collaboratve flterng model. Our opnon s that the ratng date would also help n further mprovng the model. Our approach on Track shows the effectveness of ntegratng the ratng date n the models. 5. REFERENCES [] J. Bennett, S. Lannng, and N. Netflx. The Netflx Prze. In KDD Cup and Workshop n conuncton wth KDD, [2] C. Burges, T. Shaked, E. Renshaw, A. Lazer, M. Deeds, N. Hamlton, and G. Huller. Learnng to rank usng gradent descent. In Proceedngs of the 22nd

7 nternatonal conference on Machne learnng, ICML 05, pages 89 96, New York, NY, USA, ACM. [3] G. Dror, Y. Koren, and M. Wemer. Yahoo! Musc Ratng Dataset for the KDD Cup 20 Track2. [4] Y. Koren. Factorzaton meets the neghborhood: a multfaceted collaboratve flterng model. In KDD 08: Proceedng of the 4th ACM SIGKDD nternatonal conference on Knowledge dscovery and data mnng, pages ACM, [5] A. Paterek. Improvng regularzed sngular value decomposton for collaboratve flterng. Proceedngs of KDD Cup and Workshop, [6] R. Salakhutdnov, A. Mnh, and G. E. Hnton. Restrcted boltzmann machnes for collaboratve flterng. In ICML, pages , [7] M. Wemer, A. Karatzoglou, and A. J. Smola. Improvng maxmum margn matrx factorzaton. In W. Daelemans, B. Goethals, and K. Mork, edtors, Machne Learnng and Knowledge Dscovery n Databases, volume 52 of LNAI, pages 4 4. Sprnger, 2008.

8 no model error rate vald meta parameters AFM 5.877% F =3000 epochs=30 η= α=-50 2 AFM 6.64% F =000 epochs=30 η= α=-50 3 AFM flpped 7.050% F =50 epochs=50 η= α=-50 4 ASVD flpped 5.5% F =200 epochs=30 η= α=-50 5 ASVD flpped taxo 4.78% F =3000 epochs=40 η= α=-50 6 ASVD flpped rated taxo 4.65% F =200 epochs=50 η= α=-50 7 ASVD flpped rated taxo 4.258% F =000 epochs=45 η= α=-50 8 ASVD flpped rated taxo 4.280% F =3000 epochs=45 η= α=-63 9 ASVD flpped rated taxo 4.277% F =000 epochs=45 η= α=-63 0 ASVD flpped rated taxo 5.96% F =00 epochs=40 η= α=-63 ASVD flpped rated taxo 7.62% F =30 epochs=20 η= α=-63 2 ASVD flpped rated taxo 7.263% F =50 epochs=30 η= α=-50, same type n samplng 3 ASVD 6.00% F =000 epochs=5 η= α=0 4 ASVD 5.949% F =000 epochs=35 η= α=0 5 Basc Taxonomy (-4) 8% - 47% detals n Secton 2. 6 RBM dscrete 28.8% nhd=0 η=0.000 λ= RBM dscrete 28.29% nhd=0 η=0.000 λ= RBM dscrete 2.92% nhd=00 η= λ= RBM dscrete 0.% nhd=300 η=0.005 λ= Item-Item KNN Pearson 6.03% σ=0.5 γ=-8.2 K=29 2 Item-Item KNN Pearson 3.68% σ=9.2 γ=-3.2 K= Item-Item KNN SVD feat 5.66% σ=0.62 γ=-2.6 K=27 (SVD: F =00 η=0.000 epochs=30) 23 Item-Item KNN SVD feat 5.78% σ=.5 γ=-.66 K=67 (SVD: F =00 η= epochs=50) 24 User-User KNN Pearson 8.35% σ=23.3 γ=-4. K= SVD (regresson) 5.50% F =00 epochs=20 η= λ=.5 26 SVD (regresson) 5.0% F =500 epochs=20 η= λ=.5 27 SVD 6.88% F =00 epochs=30 η= SVD 5.84% F =000 epochs=25 η= Table 2: Predctors n the fnal bl. The Neural-Network bler acheves an error rate of 2.67% on the valdaton set and an error rate of % on the KDD Cup 20 Track2 leaderbord.

Feature-Based Matrix Factorization

Feature-Based Matrix Factorization Feature-Based Matrx Factorzaton arxv:1109.2271v3 [cs.ai] 29 Dec 2011 Tanq Chen, Zhao Zheng, Quxa Lu, Wenan Zhang, Yong Yu {tqchen,zhengzhao,luquxa,wnzhang,yyu}@apex.stu.edu.cn Apex Data & Knowledge Management