A Novel Distributed Collaborative Filtering Algorithm and Its Implementation on P2P Overlay Network*

A Novel Dstrbuted Collaboratve Flterng Algorthm and Its Implementaton on P2P Overlay Network* Peng Han, Bo Xe, Fan Yang, Jajun Wang, and Rumn Shen Department of Computer Scence and Engneerng, Shangha Jao Tong Unversty, Shangha 200030, Chna {phan,bxe,fyang,jjwang,rmshen}@sjtu.edu.cn Abstract. Collaboratve flterng (CF) has proved to be one of the most effectve nformaton flterng technques. However, as ther calculaton complexty ncreased quckly both n tme and space when the record n user database ncreases, tradtonal centralzed CF algorthms has suffered from ther shortage n scalablty. In ths paper, we frst propose a novel dstrbuted CF algorthm called PpeCF through whch we can do both the user database management and predcton task n a decentralzed way. We then propose two novel approaches: sgnfcance refnement (SR) and unanmous amplfcaton (UA), to further mprove the scalablty and predcton accuracy of PpeCF. Fnally we gve the algorthm framework and system archtecture of the mplementaton of PpeCF on Peer-to-Peer (P2P) overlay network through dstrbuted hash table (DHT) method, whch s one of the most popular and effectve routng algorthm n P2P. The expermental data show that our dstrbuted CF algorthm has much better scalablty than tradtonal centralzed ones wth comparable predcton effcency and accuracy. 1 Introducton Collaboratve flterng (CF) has proved to be one of the most effectve nformaton flterng technques snce Goldberg et al [1] publshed the frst account of usng t for nformaton flterng. Unlke content-based flterng, the key dea of CF s that users wll prefer those tems that people wth smlar nterests prefer, or even that dssmlar people don t prefer. The k-nearest Neghbor (KNN) method s a popular realzaton of CF for ts smplcty and reasonable performance. Up to now, many successful applcatons have been bult on t such as GroupLens [4], Rngo [5]. However, as ts computaton complexty ncreased quckly both n tme and space as the record n the database ncreases, KNN-based CF algorthm suffered a lot from ts shortage n scalablty. One way to avod the recommendaton-tme computatonal complexty of a KNN method s to use a model-based method that uses the users preferences to learn a model, whch s then used for predcatons. Breese et al utlzes clusterng and Bayesan network for a model-based CF algorthm n [3]. Its results show that the clusterng-based method s the more effcent but sufferng from poor accuracy whle * Supported by the Natonal Natural Scence Foundaton of Chna under Grant No. 60372078 H. Da, R. Srkant, and C. Zhang (Eds.): PAKDD 2004, LNAI 3056, pp. 106 115, 2004 Sprnger-Verlag Berln Hedelberg 2004

A Novel Dstrbuted Collaboratve Flterng Algorthm and Its Implementaton 107 the Bayesan networks prove only practcal for envronments n whch knowledge of user preferences changes slowly. Further more, all model-based CF algorthms stll requre a central database to keep all the user data whch s not easy to acheve sometme not only for technques reasons but also for prvacy reasons. An alternatve way to address the computatonal complexty s to mplement KNN algorthm n a dstrbuted manner. As Peer-to-Peer (P2P) overlay network gans more and more popularty for ts advantage n scalablty, some researchers have already begun to consder t as an alternatve archtecture [7,8,9] of centralzed CF recommender system. These methods ncrease the scalablty of CF recommender system dramatcally. However, as they used a totally dfferent mechansm to fnd approprate neghbors than KNN algorthms, ther performance s hard to analyze and may be affected by many other factors such as network condton and selforganzaton scheme. In ths paper we solve the scalablty problem of KNN-based CF algorthm by proposng a novel dstrbuted CF algorthm called PpeCF whch has the followng advantage: 1. In PpeCF, both the user database management and predcton computaton task can be done n a decentralzed way whch ncreases the algorthm s scalablty dramatcally. 2. PpeCF keeps all the other features of tradtonal KNN CF algorthm so that the system s performance can be analyzed both emprcally and theoretcally and the mprovement on tradtonal KNN algorthm can also be appled here. 3. Two novel approaches have been proposed n PpeCF to further mprove the predcton and scalablty of KNN CF algorthm and reduce the calculaton complexty from to where M s the user number n the database and N s the tems number. 4. By desgnng a heurstc user database dvson strategy, the mplementaton of PpeCF on a dstrbuted-hash-table (DHT) based P2P overlay network s qute straghtforward whch can obtan effcent user database management and retreval at the same tme. The rest of ths paper s organzed as follows. In Secton 2, several related works are presented and dscussed. In Secton 3, we gve the archtecture and key features of PpeCF. Two technques: SR and UA are also proposed n ths secton. We then gve the mplementaton of PpeCF on a DHT-based P2P overlay network n Secton 4. In Secton 5 the expermental results of our system are presented and analyzed. Fnally we make a bref concludng remark and gve the future work n Secton 6. 2 Related Works 2.1 Basc KNN-Based CF Algorthm Generally, the task of CF s to predct the votes of actve users from the user database whch conssts of a set of votes correspondng to the vote of user on tem j. The KNN-based CF algorthm calculates ths predcton as a weghted average of other users votes on that tem through the followng formula:

108 P. Han et al. n a, j = va + κ ϖ ( a, j)( v, j v ) = 1 P (1) P, Where a j denotes the predcton of the vote for actve user a on tem j and n s the number of users n user database. v s the mean vote for user as: Where 1 (2) v = v, j I j I I s the set of tems on whch user has voted. The weghts ϖ ( a, j) reflect the smlarty between actve user and users n the user database. κ s a normalzng factor to make the absolute values of the weghts sum to unty. 2.2 P2P System and DHT Routng Algorthm The term Peer-to-Peer refers to a class of systems and applcatons that employ dstrbuted resources to perform a crtcal functon n a decentralzed manner. Some of the benefts of a P2P approach nclude: mprovng scalablty by avodng dependency on centralzed ponts; elmnatng the need for costly nfrastructure by enablng drect communcaton among clents. As the man purpose of P2P systems are to share resources among a group of computers called peers n a dstrbuted way, effcent and robust routng algorthms for locatng wanted resource s crtcal to the performance of P2P systems. Among these algorthms, dstrbuted hash table (DHT) algorthm s one of the most popular and effectve and supported by many P2P systems such as CAN [10], Chord [11], Pastry [12], and Tapestry [13]. A DHT overlay network s composed of several DHT nodes and each node keeps a set of resources (e.g., fles, ratng of tems). Each resource s assocated wth a key (produced, for nstance, by hashng the fle name) and each node n the system s responsble for storng a certan range of keys. Peers n the DHT overlay network locate ther wanted resource by ssue a lookup(key) request whch returns the dentty (e.g., the IP address) of the node that stores the resource wth the certan key. The prmary goals of DHT are to provde an effcent, scalable, and robust routng algorthm whch ams at reducng the number of P2P hops, whch are nvolved when we locate a certan resource, and to reduce the amount of routng state that should be preserved at each peer. 3 PpeCF: A Novel Dstrbuted CF Algorthm 3.1 Basc PpeCF Algorthm The frst step to mplement CF algorthm n a dstrbuted way s to dvde the orgnal centralzed user database nto fractons whch can then be stored n dstrbuted peers. For concson, we wll use the term bucket to denote the dstrbuted stored fracton of

A Novel Dstrbuted Collaboratve Flterng Algorthm and Its Implementaton 109 user database n the followng of ths paper. Here, two crtcal problems should be consdered. The frst one s how to assgn each bucket wth a unque dentfer through whch they can be effcently located. The second s whch bucket should be retreved when we need to make predcton for a partcular user. Here, we solve the frst problem by proposng a dvson strategy whch makes each bucket hold a group of users record who has a partcular <ITEM_ID, VOTE> tuple. It means that users n the same bucket at least voted one tem wth the same ratng. Ths <ITEM_ID, VOTE> wll then be used to a unque key as the dentfer for the bucket n the network whch we wll descrbe n more detal n Secton 4. To solve the second problem, we propose a heurstc bucket choosng strategy by only retrevng those buckets whose dentfers are the same wth those generated by the actve user s ratngs. Fgure 1 gves the framework of PpeCF. Detals of the functon of lookup(key) and mplementon of PpeCF on DHT-based P2P overlay network wll be descrbed n Secton 4. The bucket choosng strategy of PpeCF s based on the assumpton that people wth smlar nterests wll at least rate one tem wth smlar votes. So when makng predcton, PpeCF only uses those users records that are n the same bucket wth the actve user. As we can see n Fgure 5 of secton 5.3.1, ths strategy have very hgh httng rato. Stll, we can see that through ths strategy we reduce about 50% calculaton than tradtonal CF algorthm and obtan comparable predcton as shown n Fgure 6 and 7 n secton 5. Algorthm: PpeCF Input: ratng record of the actve user, target tem Output: predctve ratng for target tem Method: For Each <ITEM_ID, VOTE> tuple n the ratng record of actve user: 1) Generate the key correspondng to the <ITEM_ID, VOTE> through the hash algorthm used by DHT 2) Fnd the host whch holds the bucket wth the dentfer key through the functon lookup(key) provded by DHT 3) Copy all ratngs n bucket key to current host Use tradtonal KNN-based CF algorthm to calculate to predctve ratng for target tem. Fg. 1. Framework of PpeCF 3.2 Some Improvement 3.2.1 Sgnfcance Refnement (SR) In the basc PpeCF algorthm, we return all users whch are n the at least one same bucket wth the actve user and fnd that the algorthm has an O(N) fetched user number where N s the total user number as Fgure 7 shows. In fact, as Breese presented n [3] by the term nverse user frequency, unversally lked tems are not as useful as less common tems n capturng smlarty. So we ntroduce a new concept sgnfcance refnement (SR) whch reduces the returned user number of the basc PpeCF algorthm by lmtng the number of returned users for each bucket. We term

110 P. Han et al. the algorthm mproved by SR as Return K whch means for every tem, the PpeCF algorthm returns no more than K users for each bucket. The expermental result n Fgure 7 and 8 of secton 5.3.3 shows that ths method reduces the returned user number dramatcally and also mproves the predcton accuracy. 3.2.2 Unanmous Amplfcaton (UA) In our experment n KNN-based CF algorthm, we have found that some hghly correlated neghbors have lttle tems on whch they vote the same ratng wth the actve users. These neghbors frequently prove to have worse predcton accuracy than those neghbors who have same ratng wth actve users but relatvely lower correlaton. So we argue that we should gve specal award to the users who rated some tems wth the same vote by amplfy ther weghts, whch we term Unanmous Amplfcaton. We transform the estmated weghts as follows: Where a wa, N a, = 0 w a, = wa, α 0 < N a, γ wa, β N a, > γ N, denotes the number of tems whch user a and user have the same votes. A typcal value for α for our experments s 2.0, β s 4.0, and γ s 4. Expermental result n Fgure 9 of secton 4.3.4 shows that UA approach mproves the predcton accuracy of the PpeCF algorthm. (3) 4 Implementon of PpeCF on a DHT-Based P2P Overlay Network 4.1 System Archtecture Fgure 2 gves the system archtecture of our mplementaton of PpeCF on the DHTbased P2P overlay network. Here, we vew the users ratng as resources and the system generate a unque key for each partcular <ITEM_ID, VOTE> tuple through the hash algorthm, where the ITEM_ID denotes dentty of the tem user votes on and VOTE s the user s ratng on that tem. As dfferent users may vote partcular tem wth same ratng, each key wll correspond to a set of users who have the same <ITEM_ID, VOTE> tuple. As we stated n secton 3, we call such set of users record as bucket. As we can see n Fgure 2, each peer n the dstrbuted CF system s responsble for storng one or several buckets. Peers are connected through a DHTbased P2P overlay network. Peers can fnd ther wanted buckets by ther keys effcently through the DHT-based routng algorthm. As we can see from Fgure 1 and Fgure 2, the mplementaton of our PpeCF on DHT-based P2P overlay network s qute straghtforward except two key peces: how to store the buckets and fetch them back effectvely n ths dstrbuted envronment. We solve these problems through two fundamental DHT functon: put(key) and lookup(key) whch are descrbed n Fgure 3 and Fgure 4 respectvely.

A Novel Dstrbuted Collaboratve Flterng Algorthm and Its Implementaton 111 These two functons nhert from DHT the followng merts: Scalablty: t must be desgned to scale to several mllon nodes. Effcency: smlar users should be located reasonably quck and wth low overhead n terms of the message traffc generated. Dynamcty: the system should be robust to frequent node arrvals and departures n order to cope wth hghly transent user populatons characterstc to decentralzed envronments. Balanced load: n keepng wth the decentralzed nature, the total resource load (traffc, storage, etc) should be roughly balanced across all the nodes n the system. Fg. 2. System Archtecture of Dstrbuted CF Recommender System 5 Expermental Results 5.1 Data Set We use EachMove data set [6] to evaluate the performance of mproved algorthm. The EachMove data set s provded by the Compaq System Research Center, whch ran the EachMove recommendaton servce for 18 months to experment wth a collaboratve flterng algorthm. The nformaton they gathered durng that perod conssts of 72,916 users, 1,628 moves, and 2,811,983 numerc ratngs rangng from 0 to 5. To speed up our experments, we only use a subset of the EachMove data set. 5.2 Metrcs and Methodology The metrcs for evaluatng the accuracy of we used here s statstcal accuracy metrcs whch evaluate the accuracy of a predctor by comparng predcted values wth userprovded values. More specfcally, we use Mean Absolute Error (MAE), a statstcal

112 P. Han et al. accuracy metrcs, to report predcton experments for t s most commonly used and easy to understand: MAE = a T a, j va, j p T Where v a, j s the ratng gven to tem j by user a, s the predcted value of user a on tem j, T s the test set, T s the sze of the test set. (4) Algorthm: DHT-based CF puts a peer P s vote vector to DHT overlay network Input: P s vote vector Output: NULL Method: For each <ITEM_ID, VOTE> n P s vote vector: 1) P generates a unque 128-bt DHT Key K local (.e. hash the system unque username). 2) P hashes ths <ITEM_ID, VOTE> tuple to key K, and routes t wth P s vote vector to the neghbor P whose local key K _local s the most smlar wth K. 3) When P receves the PUT message wth K, t caches t. And f the most smlar neghbor s not tself, t just routes the message to ts neghbor whose local key s most smlar wth K. Fg. 3. DHT Put(key) Functon Algorthm: lookup(key) Input: dentfer key of the targeted bucket Output: targeted bucket (retreved from other peers) Method: 1) Routes the key wth the targeted bucket to the neghbor P whose local key K _local s the most smlar wth K. 2) When P receves the LOOKUP message wth K, f P has enough cached vote vectors wth the same key K, t returns the vectors back to P, otherwse t just routes the message to ts neghbor whose local key s most smlar wth K. Anyway, P wll fnally get all the records n the bucket whose dentfer s key. Fg. 4. DHT Lookup(key) Functon We select 2000 users and choose one user as actve user per tme and the remander users as hs canddate neghbors, because every user only make self s recommendaton locally. We use the mean predcton accuracy of all the 2000 users as the system's predcton accuracy. For every user s recommendaton calculaton, our tests are performed usng 80% of the user s ratngs for tranng, wth the remander for testng.

A Novel Dstrbuted Collaboratve Flterng Algorthm and Its Implementaton 113 5.3 Expermental Result We desgn several experments for evaluatng our algorthm and analyze the effect of varous factors (e.g., SR and UA, etc) by comparson. All our experments are run on a Wndows 2000 based PC wth Intel Pentum 4 processor havng a speed of 1.8 GHz and 512 MB of RAM. 5.3.1 The Effcency of Neghbor Choosng We used a data set of 2000 users and show among the users chosen by PpeCF algorthm, how many are n the top-100 users n Fgure 5. We can see from the data that when the user number rses above 1000, more than 80 users who have the most smlartes wth the actve users are chosen by PpeCF algorthm. Fg. 5. How Many Users Chosen by PpeCF n Tradtonal CF s Top 100 Fg. 6. PpeCF vs. Tradtonal CF 5.3.2 Performance Comparson We compare the predcton accuracy of tradtonal CF algorthm and PpeCF algorthm whle we apply both top-all and top-100 user selecton on them. The results are shown as Fgure 6. We can see that the DHT-based algorthm has better predcton accuracy than the tradtonal CF algorthm. 5.3.3 The Effect of Sgnfcance Refnement We lmt the number of returned user for each bucket by 2 and 5 and do the experment n Secton 5.3.2 agan. The user for each bucket s chosen randomly. The result of the number of user chosen and the predcton accuracy s shown n Fgure 7 and Fgure 8 respectvely. The result shows: Return All has an O(N) returned user number and ts predcton accuracy s also not satsfyng; Return 2 has the least returned user number but the worst predcton accuracy; Return 5 has the best predcton accuracy and the scalablty s stll reasonably well (the returned user number s stll lmted to a constant as the total user number ncreases).

114 P. Han et al. Fg. 7. The Effect on Scalablty of SR on PpeCF Fg. 8. The Effect on Predcton Accuracy of SR on PpeCF Algorthm Fg. 9. The Effect on Predcton Accuracy of Unanmous Amplfcaton 5.3.4 The Effect of Unanmous Amplfcaton We adjust the weghts for each user by usng Equaton (5) whle settng value for α as 2.0, β as 4.0, γ as 4 and do the experment n Secton 5.3.2 agan. We use the top-100 and Return All selecton method. The result shows that the UA approach mproves the predcton accuracy of both the tradtonal and the PpeCF algorthm. From Fgure 9 we can see that when UA approach s appled, the two knds of algorthms have almost the same performance. 6 Concluson In ths paper, we solve the scalablty problem of KNN-based CF algorthm by proposng a novel dstrbuted CF algorthm called PpeCF and gve ts mplementaton on a DHT-based P2P overlay network. Two novel approaches: sgnfcance refnement (SR) and unanmous amplfcaton (UA) have been proposed to mprove

A Novel Dstrbuted Collaboratve Flterng Algorthm and Its Implementaton 115 the performance of dstrbuted CF algorthm. The expermental data show that our algorthm has much better scalablty than tradtonal KNN-based CF algorthm wth comparable predcton effcency. References 1. Davd Goldberg, Davd Nchols, Bran M. Ok, Douglas Terry.: Usng collaboratve flterng to weave an nformaton tapestry, Communcatons of the ACM, v.35 n.12, p.61-70, Dec. 1992. 2. J. L. Herlocker, J. A. Konstan, A. Borchers, and J. Redl.: An algorthmc framework for performng collaboratve flterng. In Proceedngs of the 22nd annual nternatonal ACM SIGIR conference on Research and development n nformaton retreval, pages 230-237, 1999. 3. Breese, J., Heckerman, D., and Kade, C.: Emprcal Analyss of Predctve Algorthms for Collaboratve Flterng. Proceedngs of the 14th Conference on Uncertanty n Artfcal Intellgence, 1998 (43-52). 4. Paul Resnck, Neophytos Iacovou, Mtesh Suchak, Peter Bergstrom, John Redl.: GroupLens: an open archtecture for collaboratve flterng of netnews, Proceedngs of the 1994 ACM conference on Computer supported cooperatve work, p.175-186, October 22-26, 1994, Chapel Hll, North Carolna, Unted States. 5. Upendra Shardanand, Patte Maes.: Socal nformaton flterng: algorthms for automatng word of mouth, Proceedngs of the SIGCHI conference on Human factors n computng systems, p.210-217, May 07-11, 1995, Denver, Colorado, Unted States. 6. Eachmove collaboratve flterng data set.: http://research.compaq.com/src/eachmove 7. Amund Tvet.: Peer-to-peer based Recommendatons for Moble Commerce. Proceedngs of the Frst Internatonal Moble Commerce Workshop, ACM Press, Rome, Italy, July 2001, pp. 26-29. 8. Tomas Olsson.: "Bootstrappng and Decentralzng Recommender Systems", Lcentate Thess 2003-006, Department of Informaton Technology, Uppsala Unversty and SICS, 2003 9. J. Canny.: Collaboratve flterng wth prvacy. In Proceedngs of the IEEE Symposum on Research n Securty and Prvacy, pages 45--57, Oakland, CA, May 2002. IEEE Computer Socety, Techncal Commttee on Securty and Prvacy, IEEE Computer Socety Press. 10. S. Ratnasamy, P. Francs, M. Handley, R. Karp, and S. Shenker.: A scalable contentaddressable network. In SIGCOMM, Aug. 2001 11. Stocal I et al.: Chord: A scalable peer-to-peer lookup servce for Internet applcatons (2001). In ACM SIGCOMM, San Dego, CA, USA, 2001, pp.149-160 12. Rowstron A. Druschel P.: Pastry: Scalable, dstrbuted object locaton and routng for large scale peer-to-peer systems. In IFIP/ACM Mddleware, Hedelberg, Germany, 2001 13. Zhao B Y et al.: Tapestry: An nfrastructure for fault-tolerant wde-area locaton and routng. Tech.Rep.UCB/CSB-0-114,UC Berkeley, EECS,2001