Abstract. 1 Introduction

Size: px

Start display at page:

Download "Abstract. 1 Introduction"

Clement Bryan
5 years ago
Views:

1 Challenges and Solutons for Synthess of Knowledge Regardng Collaboratve Flterng Algorthms Danel Lowd, Olver Godde, Matthew McLaughln, Shuzhen Nong, Yun Wang, and Jonathan L. Herlocker. School of Electrcal Engneerng and Computer Scence Oregon State Unversty 102 Dearborn Hall Corvalls, OR {,, mclaughm, nong, wangyun, Abstract Collaboratve flterng (CF)-based recommender systems predct what tems a user wll lke or fnd useful based on the recommendatons (actve or mplct) of other members of a networked communty. In spte of more than ten years of research, there s lttle consensus on state-of-the-art knowledge regardng CF predctve algorthms. There are many barrers to synthess of the sgnfcant quantty of avalable publshed research on CF algorthms. We present results from an emprcal study that attempts synthess on popular CF algorthms and use ths study to llustrate some key challenges to synthess n CF algorthm research. In response to these challenges we propose the development of publcly mantaned reference mplementatons of proposed CF algorthms and emprcal evaluaton procedures and we ntroduce CoFE, a publc software framework wth the goal of jumpstartng the buldng of these reference mplementatons. Fnally, we demonstrate how CoFE was used to mplement a hghperformance nearest-neghbor-based algorthm that scales to arbtrary numbers of users. 1 Introducton Read any good books lately? Every day, people ask each other questons such as ths n an attempt to sort through the plethora of optons that both enrch and afflct modern lvng. In a world wth vastly more books avalable than any of us has tme to look at, much less read, we all need some way to decde among them. By passng along recommendatons to frends wth smlar taste, people dstrbute the work of fndng good books n order to spend more tme readng books they enjoy. Unfortunately, not all of our frends share our tastes, lmtng the number of useful counselors avalable. Furthermore, those frends who do share our tastes probably haven t read every book we mght lke, and can only furnsh recommendatons for the few they know. Of course, the problem s more general than fndng good books people must make decsons about a great many thngs, ncludng moves, restaurants, web stes, house plants, tropcal resorts, and so on. How can we get better recommendatons on such an ever-wdenng assortment of optons? Danel Lowd s now at the Department of Computer Scence and Engneerng, Unversty of Washngton, Seattle, WA, 98195, lowd@cs.washngton.edu. Olver Godde can be reached at ogodde@yahoo.com. 1/38

2 1.1 Bref Introducton to Collaboratve Flterng & Recommender Systems Collaboratve flterng-based recommender systems address precsely ths problem by drawng upon the experences of thousands or even mllons of people. For example, a book recommendaton web ste usng collaboratve flterng technology could combne the ratngs of mllons of onlne users to gve better and broader book recommendatons than any of the users could get from frends. The work of fndng good books s thus dstrbuted, as n a crcle of frends, but on a much larger scale. Amazon.com s one well-recognzed example of a ste that uses collaboratve flterng n ths manner. In addton to helpng users fnd tems of nterest, collaboratve flterng has proven to beneft e-commerce retalers as well. Amazon.com reported that many more sales result from tems recommended by collaboratve flterng than from those shown on bestseller or featured tems lsts [15]. Another success story found that collaboratveflterng based recommendaton e-mals generated twce as many purchases as manual recommendatons [27]. These systems are just two of the many recommender systems, commercal and academc, that have been developed usng collaboratve flterng. Other systems have ncluded MoveLens and NetFlx.com for recommendng moves, PHOAKS for recommendng webstes, Rngo for recommendng musc, Jester for recommendng jokes, and many more [7,9,25,26]. 1.2 Important Termnology Used n ths Paper Recommender systems are a relatvely new area of study and standardzed vocabulary s only begnnng to emerge. Here we brefly provde our defntons for the termnology used n ths paper. Resnck et al. descrbe a recommender system as follows: In a typcal recommender system people provde recommendatons as nputs whch the system then aggregates and drects to approprate recpents. [20] The most common technology used to mplement recommender systems s collaboratve flterng, whch we may refer to as CF for short. Recommender systems may ncorporate non-cf technology. However, n ths artcle, we focus exclusvely on CF technology. We refer to anythng recommended by a recommender system as an tem. Ths term could refer not only to books and moves, but also to restaurants, web pages, New Year s resolutons, and so on. The type of tems beng recommended and the context n whch they are recommended s known as the content doman of recommendaton (e.g. web-based book recommendaton doman). A user s an ndvdual who nteracts wth a recommender system, provdng the system wth ratngs n order to receve recommendatons or predctons. Ratngs are statements of preference by users for tems. At a mnmum, a ratng conssts of three elements: a user, an tem, and a ratng value. The ratng value may be bnary, nteger valued, real-valued, or even unary (a sngle postve ratng value, but no negatve or ambvalent ratng values). For nteger- and real-valued ratngs, low numbers generally ndcate negatve preference (the tem was bad), mddle numbers ndcate ambvalence (the tem was nether good nor bad), and hgh number ndcate postve preference (the tem was great!). A predcton or predcted ratng s a recommender system s estmate of the ratng value that a user would assgn to an tem. We refer to a recommendaton as an tem wth a hgh predcted 2/38

3 ratng for a user that s recommended to the user. Recommendatons are often called best bets. A collaboratve flterng algorthm s a procedure that examnes ratngs data from users, nfers preference patterns among many users and many tems, and computes a predcted ratng for a gven user (termed the actve user) on a gven tem (termed the actve tem). A ranked lst of recommended tems can be generated as well, by predctng ratngs for all tems and lstng the N tems whose predcted ratngs are the hghest 1. The accuracy of a CF algorthm s defned as how close an algorthm s predcted ratngs are to the true ratngs suppled by users. Explct ratngs are evaluatons that are drectly entered by users; for example, a user s ratng of 1-5 stars for an tem at Amazon.com s an explct ratng. Implct ratngs, n contrast, are ndcatons of preference that are derved from other user behavor, such as purchasng certan tems from a catalog or vstng specfc web stes. Algorthms that work wth mplct user ratngs generally operate dfferently to take nto account a very dfferent qualty of nformaton. For ths study, we chose to focus solely on explct ratngs and algorthms desgned to operate on them. A dataset s a collecton of preference ratngs data from a communty of users on a set of tems n a partcular target doman. The most popular publcly avalable datasets nvolve ratngs for moves and vdeos: the EachMove dataset [16], and the MoveLens dataset [1]. A metrc s a computaton appled to the output of a collaboratve flterng algorthm to provde an evaluaton of the qualty of the collaboratve flterng algorthm. 1.3 The Challenge of Fndng the Best CF Algorthm There have been many dfferent collaboratve flterng algorthms proposed to compute predctons of users ratngs. One algorthm wll be more effectve than the others, gven specfc crcumstances. Gven a choce, you would always want to use the most effectve algorthm possble, snce that ought to result n a better user experence, more e-commerce sales, less tme wasted browsng through rrelevant tems, and so on. For just ths reason, a more effectve algorthm has become the most popular research goal among collaboratve flterng researchers n recent years. How one best determnes f an algorthm s more effectve s stll open to debate, but most researchers use an algorthm s average accuracy when tested on an exstng database of user ratngs. Others may look at executon tme and memory requrements as well. So what are the best CF algorthms? Wth almost ten years of publshed scentfc research on the development and evaluaton of CF algorthms, we would expect to have sold recorded knowledge about whch algorthms are best for whch content domans. An examnaton of the publshed research ndcates that we are far from that goal. What we fnd are many reports of emprcal studes that are hard to generalze beyond the context of ther publshed study. We fnd many publshed artcles ntroducng new CF algorthms that follow the same template. The template begns by proposng an algorthm and then clamng expermental results showng ts 1 Snce computng the predcted ratng of every tem s usually too computatonally ntensve, scentsts have developed algorthms that approxmate the process, generatng a set of tems that have predcted ratngs above a threshold (e.g., predct tems that are lkely to be good but may not necessarly be the best recommendatons). 3/38

4 emprcal superorty over one or two baselne algorthms. Unfortunately, we fnd t hard to evaluate the strength of such results due to varatons n the expermental procedures, dfferent datasets, and dfferent algorthm mplementatons used to evaluate those algorthms. New work and new methodology s requred n CF algorthms research to brng us closer to the goal of understandng what algorthms are best for whch domans. Rather than contnue to propose new algorthms usng methodologes that nhbt cross-comparson, we need research that seeks to synthesze the dversty of work that has been done before nto a coherent pcture that can be drectly appled by practtoners seekng to mplement or employ recommender systems usng collaboratve flterng. 1.4 Contrbutons of ths Artcle Ths artcle presents some ntal results of our attempt to unfy the knowledge regardng the accuracy of collaboratve flterng recommendaton algorthms. In partcular, our contrbutons are: 1. A specfcaton of challenges faced by scentsts attemptng to synthesze the exstng publshed research on collaboratve flterng algorthms. These challenges are presented n Secton A case study representng an attempt to synthesze exstng publshed work on CF algorthm accuracy. Ths study conssts of an emprcal comparson of a collecton of proposed collaboratve flterng algorthms that prevously have only been examned ndvdually aganst non-comparable baselne algorthms. Ths case study n Secton 3 s used to llustrate the challenges that we ntroduce. The case study tself has several key contrbutons to further research on CF algorthms: a. Evdence that contradcts prevous accuracy clams for certan algorthms. b. Evdence showng that nearest neghbor algorthms (n partcular, the Item-Item algorthm) are most accurate at predctng ratng values on mult-valued ratng data of entertanment. c. Enhancements to several exstng algorthms that sgnfcantly mproved accuracy n our experments. Most notable s an adaptaton of the Bayesan network approach suggested by [4] that uses normalzed user ratngs rather than dscrete ratng classes. 3. Proposals for specfc research methodologes and research nfrastructure that would enable future research to better face the prevously descrbed challenges, enablng more genercally usable scentfc results. These proposals are dscussed n Secton 4 of ths artcle. 4. An nfrastructure that we have developed and made freely avalable to the publc as a frst step towards facng the challenges, ncludng a. A portable and hghly extensble software framework for nvestgatng collaboratve flterng algorthms, enablng rapd desgn and evaluaton of new algorthms. Ths software framework represents the frst step towards collaboratve flterng research nfrastructure that enables more effectve future CF research. 4/38

5 b. Source code for reference mplementatons of a collecton of collaboratve flterng algorthms that have been proposed by promnent CF researchers. Most of the algorthms have been tuned to ensure that they can perform at least as well as ther orgnal creators clamed. Ths nfrastructure s ntroduced n Secton Fnally, a hgh performance, producton capable recommendaton engne, bult on top of the CF software framework. Supportng well-known nearest-neghbor methods, ths engne can generate hundreds of recommendaton lsts per second, gven mllons of user ratngs. The source code for ths engne s freely avalable. Ths software, ntroduced n Secton 5.1, should greatly ncrease the avalablty of collaboratve flterng technology to all software developers. 2 Challenges to Synthess Greater scentfc advances are possble when ndvdual research contrbutons can be syntheszed nto a broader understandng through objectve comparson and thrd-party evaluaton. In ths secton, we ntroduce characterstcs of collaboratve flterng algorthms research that present challenges to achevng such an understandng. The lst of challenges that we have temzed n ths secton s not ntended to be a complete lst. Rather they are the challenges that we have found frequently obstruct our attempts to synthess. In the followng secton, Secton 3, we llustrate examples of these challenges n an emprcal case study. 2.1 Challenge 1: Dfferent Datasets Dfferent datasets have dfferent characterstcs that can sgnfcantly affect the outcome of emprcal analyss of collaboratve flterng algorthms. For example, datasets may have dfferent numbers of ratngs per user and thus dfferent amounts of tranng data or they may have dfferent granularty of ratngs one dataset may nclude ten levels of preference, whle the other may only nclude fve. When two groups of researchers use entrely dfferent datasets, t s dffcult to synthesze ther results nto any form of greater understandng. Propretary datasets ones that have not been released to the publc present addtonal challenges because scentsts not afflated wth the orgnal researchers are unable to reproduce or extend results wthout access to the data. Even when researchers use the same, publcly avalable dataset, ther results may be dangerous to compare. CF researchers often want to run hundreds or thousands of dfferent CF algorthm varants aganst the dataset. Yet wth very large datasets, ths can take an unacceptable amount of tme, so they create smaller subsets of the whole dataset. Scentsts have taken a varety of approaches (or lack of) to ensure that the subsets are representatve of the whole. The addtonal varance creates addtonal uncertanty when tryng to synthesze results acheved on dfferent subsets of the same dataset. For example, some researchers sample only users wth a mnmum number of rated tems to ensure that learnng algorthms have enough nformaton to work wth for each user [5,9]. Others may randomly sample users wthout regard for number of ratngs. 5/38

6 2.2 Challenge 2: Dfferent Evaluaton Metrcs There are many dfferent metrcs that can be appled (for a more complete dscusson of CF metrcs, see [10]), however for the purposes of ths artcle, we are consderng collaboratve flterng accuracy metrcs. Examples of accuracy metrcs nclude mean absolute error [9], precson and recall [23], and the rank half-lfe metrc [4]. When two experments use dfferent emprcal evaluaton metrcs, then the results of those metrcs are very challengng to synthesze. For example, one experment may report that algorthm A has a mean absolute error of 0.7, whle the other experment may report a precson of 70%. How do these two algorthms compare? We cannot say based on ths nformaton from the two experments; synthess s not possble. In partcular, we can see ths problem wth ranked lst evaluaton of CF algorthms, where there s no emergng standard evaluaton metrc. Ranked lst evaluaton metrcs attempt to measure the effectveness of an algorthm at producng a useful lst of top recommendatons ranked by lkely relevance or nterest to the user. Ths contrasts wth mean absolute error whch measures overall predcton error. As an example, dfferent ranked lst metrcs have been used by Breese et al. [4], Karyps et al. [13], Sarwar et al. [21], and Schen [24]. In ths study, we lmted our experments to measurng predcton error; we leave ranked lst evaluaton for future work. 2.3 Challenge 3: Dfferent Expermental Protocols An expermental protocol ncludes the procedures used to tran an algorthm to learn preferences, and the exact procedure to apply the evaluaton metrc. At a hgh level, expermental protocols for analyss of offlne data (data prevously collected) all follow a common procedure. Ths procedure can roughly be descrbed as wthhold-and-predct. A ratngs dataset s broken nto two subsets, the learnng set and the test set. The learnng set s fed nto the collaboratve flterng algorthm as tranng data, whch then predcts ratngs or makes recommendatons for tems not n the tranng set. The accuracy of those predctons or ratngs s evaluated based on the avalable ratngs n the test set. Asde from ths basc organzaton, there are many ways that expermental protocol can vary that can nhbt synthess. These varances nclude: a. Treatment of recommendatons for whch test ratngs are not avalable. If the test protocol nvolves havng an algorthm generate a top-n best recommendatons lst for each test user, then there s the stuaton that the algorthm may recommend an tem for whch we have no ratng n the test set. Ths ssue can be handled n several ways. Most commonly the expermental protocol wll smply evaluate the top N recommended tems for whch there are test ratngs. However, another approach s to assume a default negatve ratng. b. Treatment of mssng or low confdence predctons. Some algorthms are unable to produce predctons or recommendatons for tems when nsuffcent ratngs data for those tems s avalable. Several methods have been appled. The most obvous method s to gnore faled predctons, as done by [3,9]. Alternately, the set of ratngs testng can be restrcted to be only those that all algorthms could predct. Or a less accurate, less personalzed algorthm, such as the average ratng for an tem, can be used to predct n stuatons where the prmary algorthm fals to generate a predcton. 6/38

7 2.4 Challenge 4: Varance n algorthm mplementaton The fnal of the four challenges s that when dfferent scentsts mplement what they beleve to be the same algorthm, the mplementatons commonly provde dfferent results. Ths varance can occur for many reasons, ncludng: a. Dfferent nterpretaton of algorthm detals. In certan research publcatons, such as conference proceedngs, the need for brevty almost guarantees that there wll be nsuffcent space to descrbe all the detals requred to completely specfy how to mplement a partcular algorthm. Thus, scentsts tryng to re-mplement a prevously publshed algorthm often wll apply ther own nterpretaton of how detals should be handled. b. Algorthm tunng. Some algorthms have many parameters; adjustng the parameters can cause the algorthm to respond dfferently to partcularly nputs. Each scentst may tune the algorthm to meet a dfferent need usng a dfferent set of values for controllng parameters. Furthermore, scentsts rarely publsh n detal how algorthm parameters were tuned. c. Errors n the code mplementng the algorthm. It s very hard to detect errors n mplementatons of collaboratve flterng algorthms unless the errors cause the algorthm to generate results that are hghly mprobable. d. Applcaton of algorthm enhancements. Clean and smple abstract representatons of algorthms communcate the best and are more readly accepted under peer revew. Yet real world success often requres that these clean and smple algorthms be embellshed, often wth heurstcs that cannot be justfed theoretcally. These enhancements are often not dscussed n publshed work, yet certan enhancements are requred to produce the optmal accuracy. Varance n algorthm mplementaton s a consderable problem due to the dynamcs of peer revew. Scentfc peer revew culture rewards researchers who present new algorthms that outperform exstng algorthm by some crtera. In order to gan acceptance for ther new algorthm, researchers must mplement one or more of the prevously exstng algorthms, and then show that ther new algorthm out-performs them. These researchers are hghly motvated to ensure that ther new algorthm has no errors, and that t has the best enhancements appled. However, they have less ncentve to ensure that the mplementatons of the competng algorthms are optmally mplemented. 3 Case Study: An Emprcal Comparson of Popular Algorthms In spte of the aforementoned challenges (and at frst, n some gnorance of them), we set out to synthesze 10 years of recorded knowledge about collaboratve flterng (CF) algorthms through emprcal expermentaton. Ths conssted of comparng the accuracy of many proposed CF algorthms n a common controlled expermental setup. The prmary research questons of ths actvty were as follows: Could we replcate the good performance clamed by the authors of each algorthm? Could we replcate the clams of relatve performance made by authors of each algorthm? 7/38

8 Could we establsh a global rankng of algorthm qualty, wth respect to mean absolute error? We evaluated a set of algorthms on two subsets of the EachMove dataset (one wth ncreased sparsty) as well as the Jester joke dataset. Our metrc of evaluaton was mean absolute error. In ths secton, we descrbe our experment, reportng both our emprcal results and the examples of the challenges we observed. 3.1 Descrptons of Algorthms Evaluated We chose to evaluate prmarly algorthms that are frequently cted n the lterature, specfcally algorthms the work on explct ratngs data, as well as a few addtonal algorthms we constructed. In some cases, we made mnor modfcatons to exstng algorthms n order to mprove performance. For the purposes of repeatablty, we descrbe all the tested algorthms n some detal here. A summary s provded n. Code Name Implementaton Reference Modfed MIR Mean Item Ratng [9] (as average ) AMUR Adjusted Mean User Ratng [9] (as bas-from-mean average ) AMIR Adjusted Mean Item Ratng New - CORR Pearson r Correlaton [9] No VSIM Vector Smlarty [4] Yes HORT Hortng [3] Yes ITEM Item-Item [23] No BC Bayesan Clusterng [4] No BN Bayesan Network [4] No CBN Contnuous Bayesan Network New - PD Personalty Dagnoss [18] No Table 1: Summary of all algorthms ncluded n study Notaton In order to more precsely descrbe the operaton of some of the algorthms, we defne here a certan amount of mathematcal notaton for use n ths secton. In ths paper, U represents the set of all users and I represents the set of all tems. Let U be the subset of U consstng of all users who have rated tem. Analogously, let I u be the subset of I consstng of the tems rated by user u. Let r be the ratng of user u on tem, f known. In ths No No 8/38

9 document, the varables and j refer to tems and u and v refer to users. Let r u be user u s mean ratng, and let r be the mean ratng for tem. The algorthms we nvestgated all compute a predcted ratng for a user u on an tem, whch we refer to as p,. In ths context, u and may be termed the actve user and actve tem, respectvely. u Non-personalzed Algorthms One of the smplest recommendaton technques s to recommend those tems that are most popular. We refer to such algorthms as non-personalzed, snce ther ratngs reflect the preferences of the entre user set more than those of the actve user. The recommendatons of these algorthms are analogous to the New York Tmes bestsellers lst or weekly box offce statstcs. Snce these algorthms are so smple and straghtforward, we expect that more complcated algorthms should at least match ther performance n predctng ndvdual ratngs Mean Item Ratng (MIR) The smplest algorthm we mplemented uses the mean ratng of the actve tem for ts predcton, ndependent of whch user s the actve user: p = r. Ths algorthm has been used as a baselne by Breese et al. [1998], Herlocker et al. [1999], Goldberg et al. [2000], and others, sometmes under the name POP, short for popularty. We refer to ths algorthm as Mean Item Ratng to dstngush t from other non-personalzed algorthms Adjusted Mean User Ratng (AMUR) In examnng dfferent CF ratng data sets, we have found that each user has a dfferent dstrbuton of ratngs across possble ratng values. For example, 80% of one user s ratngs may have the value 4, whle 75% of another user s ratngs may have the value 3. One possble explanaton for these varatons n ratng dstrbuton s that the two users descrbed may have had dfferent perceptons of the ratng scale. For example, one user s ratng of 4 may ndcate the same underlyng preference as another user s ratng of 3. We can account for ths by usng the offsets from each user s mean ratng rather than ther raw ratngs. Herlocker et al. [1999] found that averagng these offsets and addng the actve user s mean ratng produced more accurate predctons than Mean Item Ratng: p = r u + ( v r U v, U r ) v We refer to ths as a normalzaton of ratngs even though t s not a true normalzaton n the statstcal sense 2. Algorthms that use ths knd of normalzaton assume that each user s mean ratng represents a neutral preference, and that set amounts above or below that mean represent the same preference for all users. One can thnk of examples where ths s not the case: for 2 That would requre dvdng by the standard devaton of each user s ratng dstrbuton thus creatng a normal dstrbuton. However, adjustng for dfferences n the wdth of a dstrbuton (the std. dev.) has not shown to sgnfcantly mprove predcton accuracy [Herlocker et al. 1999]. 9/38

10 example, f some users only rate tems they lke, then a user s mean ratng could be a poor ndcaton of neutral preference Adjusted Mean Item Ratng (AMIR) An alternate normalzaton technque s to use mean tem ratngs rather than mean user ratngs. By takng the Adjusted Mean User Ratng algorthm and swappng users for tems, we obtan a new algorthm wth predctons generated by the followng formula: p = r + ( j r I j u I u r j ) Ths algorthm assumes that each user rates all tems some constant amount above or below those tems mean ratngs. For example, n ths model, one user mght rate every tem 1 hgher than ts average, whle another mght rate every tem 1 below ts average. Thus, all users are stll assumed to have the same overall preferences, though ther ndvdual ratng scales may dffer Nearest Neghbor Algorthms The frst algorthms used n collaboratve flterng systems were nearest neghbor algorthms. Wth one excepton, algorthms of ths class generate predctons by frst computng the smlarty of the actve user to each potental neghbor and then dong a weghted average of the most smlar neghbors ratngs for the actve tem. The underlyng theory s that users who have rated tems smlarly n the past are lkely to do so n the future. These algorthms are all analogous to askng lke-mnded frends for tem recommendatons. The one excepton to ths s the Item-Item algorthm, whch forms a neghborhood of tems rather than users, but s otherwse qute smlar to the user-based algorthms [23]. Some researchers have referred to ths class of algorthms as memory-based, because many of them requre that all ratngs be kept n memory n order to compute predctons [4,12,18]. However, as we show n Secton 6, ths s not always the case varants of the Pearson r Correlaton algorthm usng samplng can be shown to be almost as effectve as the orgnal whle usng only a fracton of the memory Pearson r Correlaton (CORR) The Pearson r Correlaton algorthm was used n some of the earlest collaboratve flterng systems [19,25], yet t remans a popular baselne algorthm today, snce t s easy to mplement and farly effectve. In ths algorthm, Pearson s r correlaton coeffcent s used to defne the smlarty of two users based on ther ratngs for common tems: sm( v) = I I u v ( r r )( r σ σ u v u v, r ) v σ u and σ v represent the standard devatons of the ratngs of users u and v, respectvely. Both the ratng averages ( r u, r v ) and standard devatons are taken over just the common tems rated 10/38

11 by both users. In order to acheve the best possble mplementaton, we have used the modfcaton suggested by Herlocker et al. [1999], whch weghts smlartes by the number of tem ratngs n common between u and v when less than some threshold parameter : max( I u I v, γ ) sm ( v) = sm( v) γ Ths adjustment avods overestmatng the smlarty of users who happen to have rated a few tems dentcally, but may not have smlar overall preferences. Such correlatons may be hgh, but due to the lmted amount of data, we have lttle confdence n them. The adjusted smlarty weghts are used to select a neghborhood V U, consstng of the k users most smlar to u who have rated tem. If fewer than k users have postve smlarty to then only those users wth postve smlarty are used. The ratngs of these neghbors are combned nto a predcton as follows: p = r u + v V sm v) ( r v V ( v, sm ( v) r ) v Vector Smlarty (VSIM) The Vector Smlarty algorthm consders each user s set of ratngs as a vector and uses the cosne of the angle between two users ratngs vectors as a measure of ther smlarty [4]. More precsely, sm( v) = I I u u I r v r r I v, v r v, As n Pearson r Correlaton, a neghborhood V s formed consstng of the k users most smlar to the actve user that have rated the actve tem. (Breese et al. [1998] dd not lmt the number of neghbors, but we found ths step to be very helpful.) A predcton s then computed as follows: p( ) = v V v sm( v) r V sm( v) Breese et al. [1998] also proposed adjustng the smlarty weght computaton so that agreement about nfrequently rated tems would contrbute more to two users smlarty than agreement about frequently rated tems. However, we dd not fnd ths modfcaton to be helpful n our experments, so we dd not nclude t n our experments. v, 11/38

12 Hortng (HORT) One weakness Pearson r Correlaton and Vector Smlarty share s that n order for two users to be consdered smlar, they must have rated tems n common. If only a few users have rated a gven tem, none of whom has much n common wth the actve user, then Pearson r Correlaton and Vector Smlarty mght both be unable to produce a predcton. In theory, two users who rate tems smlarly to a thrd are lkely to have smlar taste, even though they may have rated no tems n common. The Hortng algorthm recognzes that ndrect smlarty by allowng neghbors to be acqured transtvely [3]. In the Hortng algorthm, each user s represented by a node n a drected graph, where a lnk from user u to user v means that user v predcts user u. Also stored wth each lnk are two ntegers, s { 1, + 1} and t Z. These varables specfy a lnear transformaton ( r) = sr t L s, t + to normalze the target user s ratngs wth respect to the orgnatng user s ratngs. Ths allows users who rate tems consstently hgher, lower, or opposte of each other to predct each other. (Note that on a 0 to n ratng scale, all useful values of t wll actually le between 2n and +2n, snce those offsets are suffcent to convert a mnmal ratng to a maxmal ratng, or vce versa, even when s = -1.) In practce, we dd not fnd negatve transformatons (s = -1) to be helpful, so we dd not use them. Adjacences n ths drected graph are determned by two threshold requrements. The frst establshes that the target user has rated a representatve sample of the tems rated by the orgnatng user. The orgnal authors called ths requrement hortng, a new word derved from cohorts, specfc to ths algorthm. User u s sad to hort another user v f ether v has rated some fracton of the tems rated by or f v has rated at least of the tems rated by u. and are both algorthm parameters. Mathematcally, user u horts user v f Iu α Iu Iv / Iv or I u Iv β. Note that ths s not symmetrc: f user u has rated 10 tems, user v has rated those same 10 tems plus 100 more, = 0.2 and = 20, then u horts v but v does not hort snce u has not rated a suffcent sample of the tems rated by v. The second threshold establshes that the target and orgnatng user tend to rate tems smlarly, after takng nto account dfferent ratng scales va the lnear transformaton ( ). The L s, t r predcton error e between two users s the average absolute dfference of ther common ratngs. More precsely, r ( I I Ls, t r u v e( v, s, t) = I I u v v, ) If there exst s and t such that e ( v, s, t) < δ for some predcton error parameter and user u horts user v, then user v s sad to predct user u. Ths means that there s a lnk from user u s node to user v s wth the varables s and t set to mnmze e ( v, s, t). These optmal values for s and t can be found by calculatng the predcton error e for each possble value of s and t. The predcted ratng for user u on tem s computed by searchng through the graph at each dstance level l = 1 k and determnng f there s at least one user n the graph wthn dstance l of the user u that has rated tem. The predcted ratng, p, s the average transformed ratng gven by all users at dstance l who have rated tem, for mnmum dstance l. For users more u 12/38

13 than one step away, transforms are composed. If no user of dstance less than or equal to k from the actve user has rated tem, then no predcton can be computed. Ths method should tend to use ratngs from better predctors f possble, but wll use worse ones as necessary. On some datasets, we found that accuracy could be mproved by addng two addtonal parameters: m, the mnmum number of neghbors requred, and M, the maxmum number of neghbors allowed. Here, neghbors are those users whose ratngs are aggregated to compute a predcton for a gven tem. Whle traversng the graph to make a predcton, f fewer than m neghbors have been found at a dstance level of l or less, the algorthm wll contnue searchng at the next level. Note that n ths case, the neghbors whose transformed ratngs are averaged to produce a predcton could come from two or more dfferent levels. Once M neghbors have been found, the algorthm wll average the transformed ratngs of those M neghbors and termnate. Neghbors of lower predcton errors were consdered frst, to ensure that the best M predctors were used. If m s greater than 1, then these M predctors could be dstrbuted over 2 or more dstance levels. Note that the modfed algorthm s equvalent to the orgnal when m = 1 and M = Item-Item (ITEM) Each of the nearest-neghbor algorthms dscussed so far fnds users who have rated the actve tem and are smlar to the actve user. An alternate approach s to fnd tems rated by the actve user that are smlar to the actve tem. Sarwar et al. [23] proposed several dfferent algorthms that used smlartes between tems, rather than users, to compute predctons. These algorthms all assume that the actve user s ratngs for tems related to the actve tem are a good ndcaton of the actve user s preference for the actve tem. Of the algorthms proposed by Sarwar et al., we only mplemented adjusted cosne smlarty, the algorthm Sarwar et al. [23] found to be most accurate; here were refer to t as Item-Item, snce t s the only algorthm we tested that computes smlartes between tems. In ths algorthm, the cosne of the angle between the tem ratng vectors s computed, after adjustng each ratng by subtractng the ratng user s mean ratng. Specfcally, sm(, j) = v U u U U j ( r v, ( r r ) v 2 r )( r u j w U j r ) ( r u w, j r ) w 2 Note that unlke Pearson r Correlaton, means are taken over all ratngs for a user or tem, not a subset of ratngs shared wth any other user or tem. We found t helpful to adjust smlarty weghts based on the number of users n common, f the number of common users was below a certan threshold: max( γ, U U j ) sm (, j) = sm(, j) γ 13/38

14 The predcted ratng for a gven user u and tem s computed usng a neghborhood of tems J I u consstng of the k tems rated by u that are most smlar to. If there are fewer than k tems wth postve smlarty to, then just those are used. p = r + j J sm, j)( r j J ( j sm (, j) r ) j Probablstc Algorthms An alternate approach to the nearest neghbor methods s to learn a probablstc model of the data, and use ths model to predct ratngs. Probablstc algorthms tend to have more drect mathematcal justfcaton than nearest neghbor methods, gven ther assumptons of user behavor. Probablstc algorthms have also been referred to as model-based algorthms [4,12,18], but we prefer the term probablstc algorthms. Nearest neghbor methods, such as the Hortng algorthm, may buld models as well, f only to represent neghbors Bayesan Clusterng (BC) Breese et al. [1998] proposed a smple probablstc model for collaboratve flterng, based on the assumpton that there are dstnct groups of users, each wth farly homogeneous taste throughout. For example, types of users who watch moves mght nclude those who love acton moves, those who love romantc comedes, those who love art flms, and so on. Usng machne learnng methods, these dfferent user groups can be learned automatcally from the data. Then, n order to predct the ratng for a partcular user on a partcular move, we could smply average each user group s mean ratng for that move, weghted by the probablty that ths partcular user s a member of that group. We mplemented the proposed Bayesan clusterng algorthm as a naïve Bayes classfer, where each tem ratng s condtonally ndependent gven user class, a hdden varable representng the user s preference type 3. In ths model, we store each probablty that a user of a gven class wll assgn a gven ratng to a gven tem. Wth the applcaton of Bayes rule, these probabltes are also suffcent to determne the probablty that a user s a member of a gven class. Note that ths s only one of several probablstc clusterng models that have been proposed for collaboratve flterng; for other models, see [27] and [11,12]. These probabltes are learned from the tranng data usng a gradent ascent approach wth a fxed number of teratons. Frst, the model s randomly ntalzed. Then n each teraton, each user s assgned to the most probable class based on prevously rated tems. Snce the membershp of each user class may have changed, user class probablty dstrbutons must be recomputed. Of course, once the user class probablty dstrbutons have changed, some users may no longer be n ther most probable class, so the process repeats. The predcted ratng for an tem s an average of the expected values for each preference class multpled by the probablty that the actve user s a member of that class. 3 See [Mtchell 1997] for a more thorough explanaton of Naïve Bayes classfers and tranng through gradent ascent. 14/38

15 Bayesan Network (BN) Breese et al. [1998] proposed usng Bayesan networks for collaboratve flterng. Each node n ths model s a categorcal varable representng an tem, whose states cover every legal ratng and No Ratng. The ncluson of a No Ratng state allows the model to be learned wth complete data even f no user has rated every tem. The probablty dstrbuton for each tem s modeled by a decson tree. We used the Mcrosoft Research s WnMne Toolkt to buld all of our Bayesan networks [6]. For generatng a predcton, all nodes n the network are nstantated wth the ratngs or lack thereof for the actve user. The probablty of the No Ratng state s clamped to zero for the tem n queston and the probablty dstrbuton over all legal ratngs s generated usng Markovblanket nference [4]. In Markov-blanket nference, the probablty that a gven varable has a gven state s dependent on ts parents (varables n the gven varable s decson tree), ts chldren (varables n whose decson trees the gven varable appears), and ts chldren s parents. The predcted ratng s the resultng expected value. Unlke all prevously dscussed algorthms, the Bayesan Network algorthm drectly assumes that a mssng ratng (marked by the No Ratng state) s an ndcaton of preference (negatve preference n ths case). Ths s an nterestng approach, wth some logc behnd t the fact that you haven t watched a certan move, for example, could ndcate that you wouldn t be nterested n watchng other, smlar moves. Ths algorthm makes some addtonal assumptons regardng how user preferences may be effectvely modeled: t assumes that each dfferent ratng for an tem represents a dstnct preference class (e.g. the algorthm doesn t know that 4 s closer to 5 than 1), that a user s ratng for any gven tem depends only on that user s ratngs of a few specfc tems n the dataset, and that ths dependence can be effectvely represented usng decson trees Contnuous Bayesan Network (CBN) One of the weaknesses of the Bayesan Network algorthm used by Breese et al. [1998] s that t treats the tranng data as classfcaton examples rather than numercal ratngs. The decson trees t bulds depend on havng many tranng examples wth dentcal ratngs of several tems n order to buld the probablty dstrbuton at each leaf. It s dffcult, however, to fnd many users who have gven three or four tems dentcal ratng values, and thus most of the splts n the decson trees are on the No Ratng state (about 97% n models bult from our EachMove dataset, descrbed n Secton ). In other words, users predctons are based largely on what they choose to rate, gnorng most of the actual ratngs gven. Whle the resultng model may be nterestng to analyze, snce t shows many smple relatonshps between tems, t fals to take advantage of much of the nformaton n the orgnal data. An alternatve approach s to represent each tem ratng as a numercal, not categorcal, varable n the network. To do ths, we represented each tem s ratng as a bnary Gaussan varable, ether havng the value No Ratng, or a real number representng an offset from a user s mean ratng. The model s traned not on the raw ratngs themselves (.e., r, ), but on each ratng s offset from ts user s mean ratng (.e., r ru ). Ths algorthm has two advantages: frst, t works wth normalzed ratngs rather than raw ratngs, to take nto account dfferences between users ratng dstrbutons; second, t treats ratngs as nterrelated numbers, rather than dstnct classes. We call ths modfed algorthm Contnuous Bayesan Network, snce t closely 15/38 u

16 resembles the Bayesan Network algorthm n assumptons and mplementaton, but represents user ratngs as contnuous varables. Usng ths revsed algorthm on the EachMove dataset, we found that fewer than 55% of the decson tree splts were on No Ratng Personalty Dagnoss (PD) The Personalty Dagnoss algorthm works on the assumpton that the actve user has the same true preferences as some other user, though the observed ratngs may dffer by Gaussan nose [18]. Unque to ths algorthm s the dea that users have true ratngs for each move wth dfferng observed ratngs due to temporary moods and mpulses. Ths algorthm s also unque n that t uses both a probablstc approach and a nearest-neghbor framework: though t never computes a neghborhood drectly, t does compute smlartes and perform a weghted average over all ratngs for the tem. We nclude t among other probablstc algorthms because of the methods t uses for computng the smlarty. The smlarty between the actve user u and some neghbor v s the probablty that u s true ratngs are dentcal to v s observed ratngs. Ths s farly straghtforward to compute gven the assumpton that observed ratngs dffer from true ratngs accordng to Gaussan nose wth some varance 2. To predct a ratng for the actve user on the actve tem, the probablty of each vald ratng value s computed by summng the probabltes of all users who have gven that ratng value to the actve tem. The predcted ratng value s the one wth the hghest probablty, not the expected value, as n other probablstc approaches Other Algorthms There are algorthms we dd not fully mplement and nvestgate. Some were omtted due to tme restrants, others because they clamed no mprovement n predctve accuracy on explct ratngs data. These algorthms nclude RecTree[5], Dependency Networks[8], Egentaste [7], Sngular Value Decomposton [22], and Probablstc Latent Semantc Analyss [12]. 3.2 Expermental Methods Used n the Emprcal Study Datasets We performed our experments on three dfferent datasets n order to cover some of the varaton present n dfferent collaboratve flterng systems. The datasets we selected were those that were most avalable and most commonly used by other researchers. They are summarzed n Table 2, and descrbed n further detal n the subsectons that follow. Name Users Items Ratngs Ratngs/User Densty Ratng Scale EachMove 6,185 1, , to 5, nteger Sparse EachMove 12,144 1, , to 5, nteger Jester 17, , to 10.00, real Table 2: Summary of datasets ncluded n our nvestgaton. 16/38

17 EachMove One of the largest datasets of explct user preferences s EachMove, a move ratng database collected over a perod of 18 months by the Compaq corporaton [16]. EachMove contans the ratngs of approxmately 60,000 users for a set of 1,800 moves, 2.8 mllon ratngs n all, or an average of 46 ratngs per user. The ratng scale ranges from 0 to 5. Each ratng also has a weght assocated wth t, whch ndcates whether the user saw the move or merely clcked a button readng, That looks awful. In our analyss, we only used ratngs for moves the user actually saw, reducng the average number of ratngs per user to Fnally, we used a subset consstng of all ratngs for a random 10% of the users. We used a representatve sample of the entre dataset to enable us to analyze and cross-valdate more algorthm varants n the tmelne of our project wth the computaton power avalable Sparse EachMove To better study the effects of sparsty on each algorthm, we artfcally ncreased the sparsty of a subset of EachMove data by selectng a random 50% of the ratngs of a random 20% of the users n the complete EachMove dataset. The resultng dataset has approxmately the same number of ratngs as the frst, but spread out over twce as many users, resultng n half the densty. As before, only explct ratngs were ncluded Jester The Jester dataset conssts of ratngs on 100 jokes by almost 18,000 users, over 900,000 ratngs n all [7]. The ratngs scale goes from to , wth ncrements of 0.01, yeldng 200 dstnct possble ratngs n all. Each joke was rated mmedately after beng read by the user, usng an mage map wth one extreme representng strong lkng and the other strong dslke. Before recevng any recommendatons, all users were requred to frst rate a gauge set of 10 jokes. For a collaboratve flterng dataset, ths dataset s exceptonally dense. Snce t takes farly lttle tme to read and rate a joke, and snce only 100 jokes are avalable, a user could rate every joke n less than an hour. The average user actually rated about 50 jokes, but 50% densty s much hgher than the 2.6% densty for EachMove. Whle mportng the ratngs nto our database, we dscovered that over 900 of the ratngs were outsde the allowed range of to On the advce of Goldberg [2003], these ratngs were removed and not consdered n our experments. The almost-contnuous nature of these ratngs could present dffcultes for some algorthms. Bayesan Clusterng and Bayesan Network all buld models that compute the probabltes of a sub-populaton of users assgnng each dscrete ratng to a gven move. For a dataset wth 200 dscrete ratngs ths s mpractcal: each probablty would be very low, and many would be zero. The best way to use these algorthms would be to group the ratngs nto a smaller number of ratng ranges, and then use the algorthms on the ranges rather than the raw ratngs. Due to the tme requred to adapt these algorthms and select the optmal ratng ranges, as well as ther poor performance on the move datasets, we chose not to test these algorthms on the Jester dataset. The Hortng and Personalty Dagnoss algorthms were also desgned to work on dscrete data, but were ncluded n the experment anyway. Specfcally, the Hortng algorthm uses an nteger offset t for normalzng one user s ratngs wth respect to another s; f we were to let t be 17/38

18 fractonal nstead, perhaps ncreased fneness would mprove ths algorthms performance on the dataset. The Personalty Dagnoss computes the probablty that a user s ratng s each nteger, and recommends the nteger ratng wth the hghest probablty. Ths method thus ntroduces artfcal coarseness when used on almost contnuous data, and mght perform better f t returned the expected value nstead Metrcs Here we descrbe brefly our choce of evaluaton metrc. A complete dscusson on approprate metrcs for collaboratve flterng s beyond the scope of ths artcle. We refer readers to [10], for a substantal dscusson on the topc of evaluaton of collaboratve flterng systems. In ths experment, we appled two varants of the most popular accuracy metrc mean absolute error (MAE). MAE measures how close predcted ratngs are to the true ratngs. For a set of ratngs n the test set, T, MAE s defned as follows: MAE = p (,, ), r u r T u T There are usually some ratngs n T for whch a gven algorthm s unable to furnsh a predcton. For example, when usng the Pearson r Correlaton predctve algorthm, f none of the users who rated the actve tem had rated any tems n common wth the actve user, no predcton can be computed. In ths stuaton, most scentsts choose to smply remove that predcton from consderaton, so that t doesn t affect the MAE of that algorthm; however n the extremes, ths can lead to algorthms that avod makng errors by never predctng unless evdence s overwhelmng. We have assumed n ths study that hgh coverage the percentage of ratngs for whch an algorthm can supply a predcton s mportant. Thus we requre that every algorthm provde a predcton for every tem. To acheve ths goal, whenever an algorthm cannot produce a predcton (usually because the computaton does not consder all the data n the dataset), we nstead supply the populaton average the Adjusted Mean User Ratng. To dstngush ths evaluaton approach from the tradtonal one, we refer to t as Augmented MAE, snce t computes the Mean Absolute Error of a gven algorthm after extendng ts coverage wth an alternate algorthm. We also mplemented the more tradtonal approach of omttng faled predctons from the average whch we contnue to reference as MAE. In the extreme case, f an algorthm refused to produce any predctons, ts Augmented MAE would be equal to that of the Adjust Mean User Ratng algorthm, whle ts MAE would be undefned. Another consderaton s that we can only evaluate the accuracy for the actve user on tems that the actve user has provded a ratng. Thus, t s possble that the tem wth the hghest predcted ratng may not be consdered n the evaluaton because we do not have the actve user s true ratng to compare aganst. Ths weakness wll exst n any offlne experment where users have not rated all tems Tunng Algorthm Parameters One of the challenges of comparng so many dfferent algorthms was ensurng that they were performng as well as reasonably possble. Ths requred both correct mplementaton and careful tunng of parameters for each algorthm. Ensurng correct mplementaton can be very challengng, snce a mnor bug or oversght could easly dsadvantage an algorthm n a slght, 18/38

Optimizing Document Scoring for Query Retrieval

Optimizing Document Scoring for Query Retrieval Optmzng Document Scorng for Query Retreval Brent Ellwen baellwe@cs.stanford.edu Abstract The goal of ths project was to automate the process of tunng a document query engne. Specfcally, I used machne learnng