An Efficient Neighbor Searching Scheme of Distributed Collaborative Filtering on P2P Overlay Network 1

An Efficient Neighbor Searching Scheme of Distributed Collaborative Filtering on P2P Overlay Network 1 Bo Xie, Peng Han, Fan Yang, Ruimin Shen Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai 200030, China {Bxie, phan, fyang, rmshen}@sjtu.edu.cn Abstract. Distributed Collaborative Filtering (DCF) has gained more and more attention as an alternative implementation scheme of CF based recommender system, because of its advantage in scalability and privacy protection. However, as there is no central user database in DCF systems, the task of neighbor searching becomes much more difficult. In this paper, we first propose an efficient distributed user profile management scheme based on distributed hash table (DHT) method, which is one of the most popular and effective routing algorithm in Peer-to-Peer (P2P) overlay network. Then, we present a heuristic neighbor searching algorithm to locate potential neighbors of the active users in order to reduce the network traffic and executive cost. The experimental data show that our DCF algorithm with the neighbor searching scheme has much better scalability than traditional centralized ones with comparable prediction efficiency and accuracy. 1 Introduction While the rapid development of network and information technology provides people with unprecedented abundant information resources, it also brings the problem of information overload. So how to help people find their interested resources attracted much attention from both the researchers and the vendors. Among the technologies proposed, Collaborative Filtering (CF) has proved to be one of the most effective for its simplicity both in theory and implementation. Since Goldberg et al [1] published the first account of using it for information filtering, CF has achieved great success both in the research [2, 3] and application [4, 5, 6] area. The key idea of CF is that users will prefer those items that people with similar interests prefer. Due to different techniques used to describe and calculate the similarities between users, CF algorithms have often been divided into two general classes [3]: memory-based algorithm and model-based algorithm. Memory-based algorithms directly calculate the similarities between the active users and other users, and then use the K most similar users (K nearest neighbors) to make prediction. In contrast, model-based algorithms first construct a predictive model from the user database, and then use it to make prediction. As the calculation complexity of both the 1 Supported by the National Natural Science Foundation of China under Grant No. 60372078

2 Bo Xie, Peng Han, Fan Yang, Ruimin Shen calculation of similarities and the construction of model increased quickly both in time and space as the record in the database increases, the two kinds of algorithms both suffered from their shortage in efficiency and scalability. So recent years, Distributed CF (DCF) has gained more and more attention as an alternative implementation scheme of CF based recommender system [8, 9] because of its advantage in scalability. However, as there is no central user database in DCF systems, the task of neighbor searching becomes much more difficult. In [8], Tviet uses a routing algorithm similar to Gnutella [15] to forward the neighbor query information. As it is a broadcasting routing algorithm, when the user number is large it will cause unimaginable heavy traffic in the network. In [9], Olsson improves this by only exchanging information between neighbors, however it also reduces the efficiency of finding similar users. Still, as they all used a totally different mechanism to make prediction, their performance is hard to analyze and the existing improvement on CF algorithms cannot be used any more. In this paper, we first propose an efficient distributed user profile management scheme based on distributed hash table (DHT) method, which is one of the most popular and effective routing algorithm in Peer-to-Peer (P2P) overlay network. Then, we present a heuristic neighbor searching algorithm to locate potential neighbors of the active users. The main advantages of our algorithm include: 1. Both the user database management and prediction computation task can be done in a decentralized way which increases the algorithm s scalability dramatically. 2. The implementation of our neighbor searching algorithm on a DHT-based P2P overlay network is quite straightforward which can obtain efficient retrieval time and excellent performance at the same time. 3. It keeps all the other features of traditional memory-based CF algorithm so that the system s performance can be analyzed both empirically and theoretically and the improvement on traditional memory-based algorithm can also be applied here. The rest of this paper is organized as follows. In Section 2, several related works are presented and discussed. In Section 3, we give the architecture and key features of our algorithm, and describe the implementation of it on a DHT-based P2P overlay network in Section 4. In Section 5 the experimental results of our system are presented and analyzed. Finally we make a brief concluding remark in Section 6. 2 Related Works 2.1 Memory-Based CF Algorithm Generally, the task of CF is to predict the votes of active users from the user database which consists of a set of votes corresponding to the vote of user i on item j. The memory-based CF algorithm calculates this prediction as a weighted average of other users votes on that item through the following formula:

An Efficient Neighbor Searching Scheme of Distributed Collaborative Filtering on P2P Overlay Network 3 P n a, j = va + κ ϖ ( a, j)( vi, j vi ) i = 1 (1) Where P a, j denotes the prediction of the vote for active user a on item j and n is the number of users in user database. v i is the mean vote for user i as: 1 (2) v i = vi j Ii, j I i Where Ii is the set of items on which user i has voted. The weights ϖ ( a, j) reflect the similarity between active user and users in the user database. κ is a normalizing factor to make the absolute values of the weights sum to unity. 2.2 P2P System and DHT Routing Algorithm The term Peer-to-Peer refers to a class of systems and applications that employ distributed resources to perform a critical function in a decentralized manner. With the pervasive deployment of computers, P2P is increasingly receiving attention in research and more and more P2P systems have been deployed on the Internet. Some of the benefits of a P2P approach include: improving scalability by avoiding dependency on centralized points; eliminating the need for costly infrastructure by enabling direct communication among clients; and enabling resource aggregation. As the main purpose of P2P systems are to share resources among a group of computers called peers in a distributed way, efficient and robust routing algorithms for locating wanted resource is critical to the performance of P2P systems. Among these algorithms, distributed hash table (DHT) algorithm is one of the most popular and effective and supported by many P2P systems such as CAN [10], Chord [11], Pastry [12], and Tapestry [13]. A DHT overlay network is composed of several DHT nodes and each node keeps a set of resources (e.g., files, rating of items). Each resource is associated with a key (produced, for instance, by hashing the file name) and each node in the system is responsible for storing a certain range of keys. There are two basic operations: (1) put(key, value); (2)lookup(key), and two layers: (1)route(key, message); (2)key/value storage in the DHT overlay network. Peers in the DHT overlay network can announce what resource they have by issue a put(key, value) message, or locate their wanted resource by issue a lookup(key) request which returns the identity (e.g., the IP address) of the node that stores the resource with the certain key. The primary goals of DHT are to provide an efficient, scalable, and robust routing algorithm which aims at reducing the number of P2P hops, which are involved when we locate a certain resource, and to reduce the amount of routing state that should be preserved at each peer. In Chord [11], each peer keeps track information of logn other peers where N is the total number of peers in the community. When a peer joins and leaves the overlay

4 Bo Xie, Peng Han, Fan Yang, Ruimin Shen network, this highly optimized version of DHT algorithm will only require notifying logn peers about that change. 3 Our Neighbor Searching Algorithm 3.1 Distributed User Profile Management Scheme Distributed user profile management has two key points: Division and Location. In our scheme, we wish to divide the original centralized user database into fractions (For concision, we will call such fractions by the term bucket in the following of this paper) in such a manner that potential neighbors can be put into the same bucket. So later we can access only several buckets to fetch the useful user profiles for the active users since retrieve all the buckets will cause unimaginable traffic in the network and often unnecessary. Here, we solve the first problem by proposing a division strategy which makes each bucket hold a group of users record who has a particular <ITEM_ID, VOTE> tuple. It means that users in the same bucket at least voted one item with the same rating. Figure1 illustrate our division strategy: Fig. 1. User Database Division Strategy Later when we want to make prediction for a particular user, we only need to retrieve those buckets which the active user s record is in. This strategy is based on the heuristic that people with similar interests will at least rate one item with similar votes. As we can see in Figure 3 of section 5.3.1, this strategy has a very high hitting ratio. Still, we can see that through this strategy we reduce about 50% calculation than traditional CF algorithm and obtain comparable prediction as shown in Figure 4 in section 5. After we make the proper division and choosing strategy of buckets, we still need an efficient way to locate and retrieve the needed buckets from the distributed network. As we mentioned in section 2.2, DHT has provided an efficient infrastructure to accomplish this task. We will discuss this in detail in section 4.

An Efficient Neighbor Searching Scheme of Distributed Collaborative Filtering on P2P Overlay Network 5 3.2 Neighbor Searching in DCF 3.2.1 Return All vs. Return K In the buckets choosing strategy mentioned in section 3.1, we return all users which are in the at least one same bucket with the active user. As we can see in Figure 5 in section 5.3.2, this strategy has an O(N) fetched user number where N is the total user number. In fact, as Breese presented in [3] by the term inverse user frequency, universally liked items are not as useful as less common items in capturing similarity. So we introduce a new concept significance refinement (SR) which reduces the returned user number of the original strategy by limiting the number of returned users for each bucket. We term this strategy improved by SR as Return K which means for every rated item, we return no more than K users for each bucket and call the original strategy as Return All. The experimental result in Figure 5 and 6 of Section 5.3.2 shows that this method reduces the returned user number dramatically and also improves the prediction accuracy. 3.2.2 Improved Return K Strategy Although the Return K strategy proved to increase the scalability of our DCF algorithm, it has the shortage of unstable when the number of users in each bucket increases because the chosen of the K users is random. As we can see in Figure 7, when the total number of users is above 40,000, the average bucket size is more than 60. So the randomization of neighbor choosing will cause much uncertainty in prediction. There are two natural ways to solve this problem: increase the value of K or do some filtering at the node holding the bucket. However, the first method will increase the traffic dramatically while the second will cause much more calculation. In our method, we add a mergence procedure in our neighbor choosing strategy, which we called Improved Return K strategy. In this new strategy, before we retrieve the user record back, we first get the user ID list from each bucket which will cause little traffic. Then we merge the same user ID locally and generate a ranked list of user ID according to their occurrence. The user ID with the most occurrences, which mean those users who have the most same voting as the active user, will appear at the top of the list. After that, we only fetch the top K users record in the ranked list and made prediction based on them. We can see from Figure 5 and 6 in Section 5, the improved return K strategy has a better scalability and prediction strategy. 4 Implementation of Our Neighbor Searching Algorithm on a DHT-based P2P Overlay Network 4.1 System Architecture Figure 2 gives the system architecture of our implementation of DCF on the DHT-based P2P overlay network. Here, we view the users rating as resources and the

6 Bo Xie, Peng Han, Fan Yang, Ruimin Shen system generate a unique key for each particular <ITEM_ID, VOTE> tuple through the hash algorithm, where the ITEM_ID denotes identity of the item user votes on and VOTE is the user s rating on that item. As different users may vote particular item with same rating, each key will correspond to a set of users who have the same <ITEM_ID, VOTE> tuple corresponding to the key in their rating vector. As we stated in section 3, we call such set of users record as bucket. As we can see in Figure 2, each peer in the distributed CF system is responsible for storing one or several buckets using the distributed storage strategy described in Figure 1. Peers are connected through a DHT-based P2P overlay network. Peers can find their wanted buckets by their keys efficiently through the DHT-based routing algorithm which is the foundation of our implementation. As we can see from Figure 1 and Figure 2, the implementation of our PipeCF on DHT-based P2P overlay network is quite straightforward. DHT has provided a basic infrastructure to do the following things: 1. Define IDs, Vote ID to Node ID assignment 2. Define per-node routing table contents 3. Lookup algorithm that uses routing tables 4. Join procedure to reflect new nodes in tables 5. Failure recovery For our DHT-based CF scenario, IDs are 128-bit numbers. Vote IDs are chosen by MD5 hash of user s <ITEM_ID,VOTE> tuple, and Node IDs are chosen by MD5 hash of user s <IP address, MAC address>. Key is stored on node with numerically closest ID. If node and key IDs are uniform, we get reasonable load balance. There are already two basic operations in DHT overlay network: put(key, value) and lookup(key). And for our special DHT-based CF scenario, we present two new operations: getuserid(votekey) and getuservalue(votekey,userkey). The first operation returns many userids which have the same <ITEM_ID,VOTE> tuple with the active user, and the second operation returns the candidate user s real rating vectors.

An Efficient Neighbor Searching Scheme of Distributed Collaborative Filtering on P2P Overlay Network 7 Local vote vector Cached vote vectors Item_ID_1 1.0 Item_ID_2 5.0 Item_ID_1 3.0 USER_5 Item_ID_n 9.0 Item_ID_1 3.0 USER_K USER_5 Local vote vector Item_ID_1 5.0 Item_ID_2 9.0 Item_ID_n 2.0 Cached vote vectors Item_ID_4 4.0 USER_3 Item_ID_4 4.0 USER_K Local vote vector Item_ID_1 9.5 Item_ID_2 3.0 Item_ID_n 7.5 Cached vote vectors Item_ID_2 4.0 USER_3 Item_ID_2 4.0 USER_T USER_4 USER_3 Local vote vector Item_ID_1 5.0 Item_ID_2 7.5 Item_ID_n 6.0 Cached Vote Vectors Local vote vector Item_ID_1 5.5 Item_ID_2 2.5 Item_ID_n 9.0 Cached vote vectors Item_ID_3 4.0 USER_3 Item_ID_3 4.0 USER_K Fig. 2. System Architecture of Distributed CF Recommender System 5 Experimental Results 5.1 Data Set We use EachMovie data set [7] to evaluate the performance of improved algorithm. The EachMovie data set is provided by the Compaq System Research Center, which ran the EachMovie recommendation service for 18 months to experiment with a collaborative filtering algorithm. The information they gathered during that period consists of 72,916 users, 1,628 movies, and 2,811,983 numeric ratings ranging from 0 to 5. 5.2 Metrics and Methodology The metrics for evaluating the accuracy of we used here is statistical accuracy metrics which evaluate the accuracy of a predictor by comparing predicted values with userprovided values. More specifically, we use Mean Absolute Error (MAE), a statistical accuracy metrics, to report prediction experiments for it is most commonly used and easy to understand: MAE = a T a, j va, j p T (3)

8 Bo Xie, Peng Han, Fan Yang, Ruimin Shen Where v a, j is the rating given to item j by user a, is the predicted value of user a on item j, T is the test set, T is the size of the test set. We select 2000 users and choose one user as active user per time and the remainder users as his candidate neighbors, because every user only make self s recommendation locally. We use ALL-BUT-ONE strategy [3] and the mean prediction accuracy of all the 2000 users as the system's prediction accuracy. 5.3 Experimental Result We design several experiments for evaluating our algorithm and analyze the effect of various factors by comparison. All our experiments are run on a Windows 2000 based PC with Intel Pentium 4 processor having a speed of 1.8 GHz and 512 MB of RAM. 5.3.1 The Efficiency of Neighbor Choosing We used a data set of 2000 users and show among the users chosen by Return-All neighbor searching scheme, how many are in the top-100 users who have the most similarities with active users calculated by the traditional memory-based CF algorithms in Figure 3. We can see from the data that when the user number rises above 1000, more than 80 users who have the most similarities with the active users are chosen by our Neighbor choosing scheme. 5.3.2 Performance Comparison We compare the prediction accuracy of traditional CF algorithm and our Return-All algorithm and the results are shown as Figure 4. We can see that our algorithm has better prediction accuracy than the traditional CF algorithm. This result looks surprising at the first sight as the traditional CF selecting similar users from the whole user database while we use only a fraction. However, as we look in depth into our strategy, we find that we may filter out those users who have high-correlations with the active users but no same ratings. We have found that these users are bad predictors in [16] which provide explanation to our result. We then compared the performance of Return K and Return All by setting the K as 5. From the result Figure 6 we can see that by eliminating those users who have same ratings on popular items, the prediction accuracy can be increased further. DHT-based bucket distribution is illustrated in Figure 7, we can see that the more users, the larger bucket size, and when the total number of users is above 40,000, the average bucket size is more than 60. At last, we compare the performance of Improved Return K strategy. We can see In Figure 6 that we can obtain better performance while retrieving only 30 user records. Size of Users in Top 100(Predict 20%) 92 90 88 86 84 82 80 78 Vector Similarity Pearson Similarity 76 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Total Size of Train Set(# of users) Fig. 3. How Many Users Chosen by DHT-based CF Fall in Traditional CF s Top 100 Mean Absolute Error 0.875 0.87 0.865 0.86 0.855 0.85 Traditional CF DHT-based CF(Return All) 0.845 400 600 800 1000 1200 1400 1600 1800 Total Size of Train Set(# of users) Fig. 4. DHT-based CF vs. Traditional CF

An Efficient Neighbor Searching Scheme of Distributed Collaborative Filtering on P2P Overlay Network 9 Size of Fetched DHT Users 1000 900 800 700 600 500 400 300 200 DHT-based CF Return All DHT-based CF Return K(K=5) DHT-based CF Inproved Return K(K=30) Mean Absolute Error 0.865 0.86 0.855 0.85 0.845 0.84 0.835 0.83 DHT-based CF Return All DHT-based CF Return K(K=5) DHT-based CF improved Return K(K=30) 100 0.825 0 400 600 800 1000 1200 1400 1600 1800 Total Size of Train Set(# of users) Fig. 5. The Effect on Scalability of Our Neighbor Searching Strategy 0.82 400 600 800 1000 1200 1400 1600 1800 Total Size of Train Set(# of users) Fig. 6. The Effect on Prediction Accuracy of Our Neighbor Searching Strategy 7000 6000 4000 users of eachmovie 40000 users of eachmovie 5000 Bucket Number 4000 3000 2000 1000 0 0 50 100 150 200 250 300 User Number per Bucket(# of users) Fig. 7. The DHT Buckets Size Distribution 6 Conclusion In this paper, we first propose an efficient distributed user profile management scheme based on distributed hash table (DHT) method in order to solve the scalability problem of centralized KNN-based CF algorithm. Then, we present a heuristic neighbor searching algorithm to locate potential neighbors of the active users in order to reduce the network traffic and executive cost. The experimental data show that our DCF algorithm with the neighbor searching scheme has much better scalability than traditional centralized ones with comparable prediction efficiency and accuracy.

10 Bo Xie, Peng Han, Fan Yang, Ruimin Shen References 1. David Goldberg, David Nichols, Brian M. Oki, Douglas Terry.: Using collaborative filtering to weave an information tapestry, Communications of the ACM, v.35 n.12, p.61-70, Dec. 1992. 2. J. L. Herlocker, J. A. Konstan, A. Borchers, and J. Riedl.: An algorithmic framework for performing collaborative filtering. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 230-237, 1999. 3. Breese, J., Heckerman, D., and Kadie, C.: Empirical Analysis of Predictive Algorithms for Collaborative Filtering. Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence, 1998 (43-52). 4. Paul Resnick, Neophytos Iacovou, Mitesh Suchak, Peter Bergstrom, John Riedl.: GroupLens: an open architecture for collaborative filtering of netnews, Proceedings of the 1994 ACM conference on Computer supported cooperative work, p.175-186, October 22-26, 1994, Chapel Hill, North Carolina, United States. 5. Upendra Shardanand, Pattie Maes.: Social information filtering: algorithms for automating word of mouth, Proceedings of the SIGCHI conference on Human factors in computing systems, p.210-217, May 07-11, 1995, Denver, Colorado, United States. 6. G. Linden, B. Smith, and J. York, Amazon.com Recommendations Item-to-item collaborative filtering, IEEE Internet Computing, Vo. 7, No. 1, pp. 7680, Jan. 2003. 7. Eachmovie collaborative filtering data set.: http://research.compaq.com/src/eachmovie 8. Amund Tveit.: Peer-to-peer based Recommendations for Mobile Commerce. Proceedings of the First International Mobile Commerce Workshop, ACM Press, Rome, Italy, July 2001, pp. 26-29. 9. Tomas Olsson.: "Bootstrapping and Decentralizing Recommender Systems", Licentiate Thesis 2003-006, Department of Information Technology, Uppsala University and SICS, 2003 10. J. Canny.: Collaborative filtering with privacy. In Proceedings of the IEEE Symposium on Research in Security and Privacy, pages 45--57, Oakland, CA, May 2002. IEEE Computer Society, Technical Committee on Security and Privacy, IEEE Computer Society Press. 11. S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Shenker.: A scalable contentaddressable network. In SIGCOMM, Aug. 2001 12. Stocal I et al.: Chord: A scalable peer-to-peer lookup service for Internet applications (2001). In ACM SIGCOMM, San Diego, CA, USA, 2001, pp.149-160 13. Rowstron A. Druschel P.: Pastry: Scalable, distributed object location and routing for large scale peer-to-peer systems. In IFIP/ACM Middleware, Hedelberg, Germany, 2001 14. Zhao B Y et al.: Tapestry: An infrastructure for fault-tolerant wide-area location and routing. Tech.Rep.UCB/CSB-0-114,UC Berkeley,EECS,2001 15. M.Ripeanu, "Peer-to-peer Architecture Case Study: Gnutella Network", Technical Report, University of Chicago, 2001. 16. Han Peng,Xie Bo,Yang Fan,Shen Ruimin, A Scalable P2P Recommender System Based on Distributed Collaborative Filtering, Expert systems with applications,27(2) Elsevier Sep 2004 (To appear)