filtering LETTER An Improved Neighbor Selection Algorithm in Collaborative Taek-Hun KIM a), Student Member and Sung-Bong YANG b), Nonmember

107 IEICE TRANS INF & SYST, VOLE88 D, NO5 MAY 005 LETTER An Improve Neighbor Selection Algorithm in Collaborative Filtering Taek-Hun KIM a), Stuent Member an Sung-Bong YANG b), Nonmember SUMMARY Nowaays, customers spen much time an effort in fining the best suitable goos since more an more information is place online To save their time an effort in searching the goos they want, a customize recommener system is require In this paper we present an improve neighbor selection algorithm that exploits a graph approach The graph approach allows us to exploit the transitivity of similarities The algorithm searches more efficiently for set of influential customers with respect to a given customer We compare the propose recommenation algorithm with other neighbor selection methos The experimental results show that the propose algorithm outperforms other methos key wors: recommener system, neighbor selection algorithm, collaborative filtering 1 Introuction Nowaays, customers spen much time an effort in fining the best suitable goos since more an more information is place on-line To save their time an effort in searching the goos they want, a customize recommener system is require A customize recommener system shoul preict the most suitable goos for customers by retrieving an analyzing their preferences A recommener system utilizes in general an information filtering technique calle collaborative filtering Collaborative filtering is wiely use for such recommener systems as Amazoncom an CD- Nowcom [1] [4] A recommener system using collaborative filtering which we call CF, is base on the ratings of the customers who have similar preferences with respect to agiven(test)customer CF calculates the similarity between the test customer an every other customer who has rate the items that are alreay rate by the test customer CF uses in general the Pearson correlation coefficient for calculating the similarity A weak point of the pure CF is that it uses all other customers incluing useless customers as the neighbors of the test customer Since CF is base on the ratings of the neighbors who have similar preferences, it is very important to select the neighbors properly to improve preiction quality There have been many investigations on how to select proper neighbors base on neighbor selection methos such as the nearest metho an the clustering-base metho They are quite popular techniques for recommener sys- Manuscript receive June 3, 004 Manuscript revise November 3, 004 The authors are with the Dept of Computer Science, Yonsei Univ, Korea a) E-mail: kimthun@csyonseiackr b) E-mail: yang@csyonseiackr DOI: 101093/ietisy/e88 5107 tems These techniques then preict customer s preferences for the items base on the results of the neighbors evaluation on the same items [1], [4] [6], [9] With millions of customers an items, a recommener system running on an existing algorithm will sufferseriously the scalability problem Therefore, there are emans for a new approach that can quickly prouce high quality preictions an can resolve the very large-scale problem Clustering techniques often lea to worse accuracy than other methos [7], [9], [1] Once clustering is one, however, performance can be very goo, since the size of a cluster that must be analyze is much smaller Therefore the clustering-base metho can solve the very large-scale problem in recommener systems In this paper we present an improve recommenation algorithm using a graph approach The propose algorithm applies the k-means clustering metho to the input ataset for fining a hanful of promising clusters each of which hols a set of customers with influential similarities for the test customer It then searches the promising clusters, as if they are graphs, in a breath-first manner to extract certain customers, calle the neighbors, whohaveeitherhigher similarities or lower similarities with respect to the test customer A cluster is consiere a graph in which vertices are customers an each ege has a weight representing the similarity between the en points(customers) of the ege The graph approach allows us to exploit the transitivity of similarities To compare the performance of the clustering-base metho with that of our metho, the EachMovie ataset of the Compaq Computer Corporation has been use [10] The EachMovie ataset consists of,811,983 preferences for 1,68 movies rate by 7,916 customers explicitly The experimental results show that the propose recommenation algorithm provies better preiction quality than other methos The rest of this paper is organize as follows Section escribes briefly CF an the clustering-base neighbor selection metho Section 3 illustrates the propose neighbor selection algorithm in etail In Sect 4, the experimental results are presente The conclusions are given in Sect 5 Clustering-Base Neighbor Selection in CF 1 Collaborative Filtering CF recommens items through builing the profiles of the Copyright c 005 The Institute of Electronics, Information an Communication Engineers

LETTER 1073 customers from their preferences for each item In CF, preferences are represente generally as numeric values which are rate by the customers Preicting the preference for a certain item that is new to the test customer is base on the ratings of other customers for the target item Therefore, it is very important to fin a set of customers, calle neighbors, with more similar preferences to the test customer for better preiction quality In CF, Eq (1) is use to preict the preference of a customer Note that in the following equation w a,k is the Pearson correlation coefficient as in Eq () [] [5], [7] {w a,k (r k,i r k )} P a,i = r a + w a,k = k w a,k k (r a, r a )(r k, r k ) (1) () (r a, r a ) (r k, r k ) In the above equations P a,i is the preference of customer a with respect to item i r a an r k are the averages of customer a s ratings an customer k s ratings, respectively r k,i an r k, are customer k s ratings for items i an, respectively, an r a, is customer a s rating for item If customers a an k have similar ratings for an item, w a,k > 0 We enote w a,k to inicate how much customer a tens to agree with customer k on the items that both customers have alreay rate In this case, customer a is a positive neighbor with respect to customer k, an vice versa If they have opposite ratings for an item, then w a,k < 0 Similarly, customer a is a negative neighbor with respect to customer k, an vice versa In this case w a,k inicates how much they ten to isagree on the item that both again have alreay rate Hence, if they on t correlate each other, then w a,k = 0 Note that w a,k can be in between -1 an 1 inclusive The Clustering-Base Neighbor Selection The k-means clustering metho creates k clusters each of which consists of the customers who have similar preferences among themselves In this metho we first select k customers arbitrarily as the initial center points of the k clusters, respectively Then each customer is assigne to a cluster in such a way that the istance between the customer an the center of a cluster is minimize The istance is calculate using the Eucliean istance, that is, a square root of the element-wise square of the ifference between the customer an each center point We then calculate the mean of each cluster base on the customers who currently belong to the cluster The mean is now consiere as the new center of the cluster After fining the new center of each cluster, we compute the istance between the new center an each customer as before in orer to fin the cluster to which the customer shoul belong Recalculating the means an computing the istances are repeate until a terminating conition is met The conition is in general how far each new center has move from the previous center; that is, if all the new centers move within a certain istance, we terminate the loop If the clustering process is terminate, we choose the cluster with the shortest Eucliean istance from its center to the test customer Finally, preiction for the test customer is calculate with all the customers in the chosen cluster The clustering-base neighbor selection metho can give a recommenation quickly to the customer in case of the very large-scale ataset, because it selects customers in the best cluster only as neighbors [9] 3 The Propose Neighbor Selection Algorithm We first escribe a neighbor selection algorithm(nsa) in [8] It consiers both high an low similarities with respect to the test customer an exploits the transitivity of similarity using a graph approach We then explain the computational complexity of the NSA 31 NSA The key ieas of NSA are to exploit the transitivity of similarities an to consier both higher an lower similarities in selecting neighbors We regar a portion of the input ataset a complete unirecte graph in which a vertex represents each customer an a weighte ege correspons to the similarity between two en points(customers) of the ege NSA creates k clusters from the input ataset with the k-means clustering metho Then it fins the best cluster C with respect to the test customer u among the k clusters NSA searches the customers in C to fin the best vertex v with the highest similarity with respect to u NSA then searches the unmarke vertices aacent to v who have the similarities either larger than H or smaller than L, where H an L are some threshol values for the Pearson correlation coefficients Note that as the threshol values change, so oes the size of the neighbors The search is performe in a breath-first manner That is, we search the aacent vertices of v accoring to H an L to fin the neighbors of u, an then search the aacent vertices of each neighbor of v in turn The search stops when we have enough neighbors for preiction Figure 1 epicts an example of how NSA selects the neighbors of v who has the higher similarities with respect to u when the search epth is A soli line inicates that the weight on the ege is larger than H A otte line means that the weight of the ege is smaller than L We now escribe NSA in etail In the algorithm, before we apply the graph approach to the input ataset, we want to remove insignificant customers with the k-means clustering metho After we obtain k clusters with the k- means clustering metho, we choose a cluster (the best

1074 IEICE TRANS INF & SYST, VOLE88 D, NO5 MAY 005 (a) Assume that v has the highest similarity for the test customer Fig 1 (b) After fining the neighbors of v (the search epth=1) (c) After selecting the neighbors of each neighbor of v (the search epth=) An example of searching the neighbors in NSA cluster) whose mean has the highest similarity with respect to the test customer u We then apply the graph approach only to the best cluster The following escribes the overall algorithm in etail When the algorithm is terminate, the set Neighbors is returne as output The Neighbor Selection Algorithm Input: the test customer u, the input ataset S Output: Neighbors 1 Create k clusters from S with the k-means clustering metho; Fin the best cluster C for the test customer u; 3 Fin the best customer v in cluster C who has the highest similarity with respect to u; 4 A v to Neighbors; 5 Traverse C from v in a breath-first manner when visiting vertices(customers) The similarity of the customer is checke to see if it is either higher than H or lower than L If so, a the customer to Neighbors; Note that Step5 is execute until we get enough neighbors an such terminating conition can be impose with how many levels(epths) we search from v in a breath-first manner 3 The Computational Complexity of the NSA The k-means clustering algorithm is arguably the most wellknown clustering algorithm in use toay Its computational complexity is O(krn), where k is the number of esire clusters, r is the number of iterations, an n is the size of the ataset [11] In collaborative filtering the complexity for the preiction epens on the size of the neighbors Therefore, if we consier all the points (customers) in the chosen cluster (the best cluster) as the neighbors, the complexity is only O(c), where c is the size of the chosen cluster The complexity of the breath-first search (BFS) in a complete unirecte graph is O(n ), where n is the size of the vertex set NSA searches the neighbors with transitivity in a graph using BFS an stops the search if the search epth is greater than a threshol value preetermine BFS for the neighbors using the transitivity mechanism in a complete unirecte graph takes O(n), where is the search epth If we consier only the chosen cluster, the search complexity is O(c) Although NSA searches the neighbors using BFS, it searches only not selecte vertices, that is, all vertices selecte as the neighbors in the previous search steps are consiere as the selecte vertices Therefore, the complexity T(c) of NSA can be compute with (3) T(c) is erive from T = T 1 +T ++T,whereT is the complexity when the search epth is T(c) is compute by each step of the following appenix In (3) an the appenix S is the number of the selecte vertices when the search epth is an S i is the number of the selecte vertices for each vertex i of S 1 For each epth if there is only one selecte vertex, then T(c) iso(c), which is the worst case If all vertices are selecte when the search epth is one, then T(c)isO(c), which is the best case 1 T(c) = (S 1 (c S i )) S 1 1 ( (S 1 i)s i ), (3) where S 0 = 1, S = c, an i S i = S 4 Experimental Results 41 Experiment Environment In orer to evaluate the preiction accuracy of the propose recommenation algorithm, we use the EachMovie ataset of the Digital Equipment Corporation for experiments [10] The EachMovie ataset consists of,811,983 preferences for 1,68 movies rate by 7,916 customers explicitly The customer preferences are represente as numeric values from 0 to 1 at an interval of 0, ie, 00, 0, 04, 06, 08, an 1 In the EachMovie ataset, one of the valuable attributes of an item is the genre of a movie There are 10 ifferent genres such as action, animation, artforeign, classic, comey, rama, family, horror, romance, an thriller For the experiment, we retrieve 3,763 customers who rate at least 100 movies among all the customers in the ataset We have chosen ranomly 50 customers as the test customers an the rest customers are the training customers For each test customer, we chose 5 movies ranomly that are actually rate by the test customer as the test movies The final experimental results are average over the results of the test sets 4 Experimental Metrics One of the statistical preiction accuracy metrics for evaluating recommener systems is the mean absolute error(mae) which is the mean of the errors of the actual customer ratings against the preicte ratings in iniviual preiction [1], [5] [7] MAE is compute by (4) In the equation, N is the total number of preictions an ε i is the error

LETTER 1075 Table 1 Methos MAE Parameters Experimental results PCF 0190479 kmcf 017146 k = 1 GCF 0178906 H = 08, = NSA 0165700 k = 1, H = 07, L = 08, = Fig The sensitivity of the number of clusters k between the preicte rating an the actual rating for item i The lower MAE is, the more accurate preiction with respect to the numerical ratings of customers we get E = N ε i i=0 N 43 Experimental Results (4) We compare the propose recommenation algorithm with two other methos The first metho is the pure collaborative filtering metho (PCF) use by the GroupLens [], [3] PCF consiers all the customers in the input ataset as neighbors without clustering The secon metho is a moifie PCF that utilizes the k-means clustering which we call kmcf For the propose algorithm we have two ifferent settings The first one is NSA The other is NSA without clustering which we call GCF The experimental results are shown in Table 1 We etermine that k=1 for both kmcf an NSA through various experiments That is, this value for k gave us the smallest MAE value among others as shown in Fig Among the parameters in the table, L an H enote that the threshol values for the Pearson correlation coefficients For example, if L = 08 anh = 07, then we select a neighbor whose similarity is either smaller than 08orlargerthan07 In the table is a search epth The results in the table show that NSA outperforms others when the search epth is We foun that when the search epth is grater than, NSA i not provie better results, because as the size of the neighbors gets larger the chances that unnecessary customers are inclue in the neighbors get higher Table 1 also shows that the methos with the graph approach using transitivity of similarities which are GCF an NSA, fin valuable neighbors in both PCF an kmcf 5 Conclusions It is crucial for a recommener system to have the capability of making accurate preiction by retrieving an analyzing customers preferences Collaborative filtering is wiely use for recommener systems Hence various efforts to overcome its rawbacks have been mae to improve preiction quality It is important to select neighbors properly in orer to make up the weak points in collaborative filtering an to improve preiction quality In this paper we propose an improve neighbor selection algorithm that fins meaningful neighbors using a graph approach along with clustering The experimental results show the propose algorithm provies the better preiction quality than other methos The propose metho utilizes a clustering technique an transitivity mechanism Therefore, it can be a choice to solve the very large-scale problem because it reuces the search space using clustering for a large ataset an also prouces high preiction quality using the transitivity of similarities The complexity of the propose metho is not high However, it may be increase a little more than that of a clustering metho in the worst case Acknowlegments We thank the Compaq Equipment Corporation for permitting us to use the EachMovie ataset This work was supporte by the Brain Korea 1 Proect in 004 References [1] BM Sarwar, G Karypis, JA Konstan, an JT Riele, Application of imensionality reuction in recommener system A case stuy, Proc ACM WebKDD 000 Web Mining for E-Commerce Workshop, pp8 90, 000 [] J Konstan, B Miller, D Maltz, J Herlocker, L Goron, an J Riel, GroupLens: Applying collaborative filtering to usenet news, Commun ACM, vol40, pp77 87, 1997 [3] P Resnick, N Iacovou, M Suchak, P Bergstrom, an J Riel, GroupLens: An open architecture for collaborative filtering of netnews, Proc ACM CSCW94 Conference on Computer Supporte Cooperative Work, pp175 186, 1994 [4] BM Sarwar, G Karypis, JA Konstan, an JT Riel, Analysis of recommenation algorithms for E-commerce, Proc ACM E- commerce 000 Conference, pp158 167, 000 [5] JL Herlocker, JA Konstan, A Borchers, an J Riel, An algorithmic framework for performing collaborative filtering, Proc n International ACM SIGIR Conference on Research an Development in Information Retrieval, pp30 37, 1999 [6] M O Connor an J Herlocker, Clustering items for collaborative filtering, Proc ACM SIGIR Workshop on Recommener Systems, 1999 [7] JS Breese, D Heckerman, an C Kaie, Empirical analysis of preictive algorithms for collaborative filtering, Proc Conference on Uncertainty in Artificial Intelligence, pp43 5, 1998 [8] T-H Kim, Y-S Ryu, S-I Park, an S-B Yang, An improve recommenation algorithm in collaborative filtering, Lecture Notes

1076 IEICE TRANS INF & SYST, VOLE88 D, NO5 MAY 005 in Computer Science, vol455, pp54 61, 00 [9] BM Sarwar, G Karypis, J A Konstan, an JT Riele, Recommener systems for large-scale e-commerce: Scalable neighborhoo formation using clustering, Proc Fifth International Conference on Computer an Information Technology, 00 [10] S Glassman, EachMovie collaborative filtering ata set, Compaq Computer Corporation, url:http://research compaqcom/src/eachmovie/ 1997 [11] S Kantabutra, Cluster computing as a tool in theoretical computer science, Proc National Workshop on Cluster Coputing 003, pp1 16, 003 [1] T-H Kim an S-B Yang, Using attributes to improve preiction quality in collaborative filtering, Lecture Notes in Computer Science, vol318, pp1 10, 004 Appenix: The computational complexity T(c) T(c) = T = T 1 + T + + T S 0 = 1, S = c, an S i = S T 1 = c S 1 T = T i = T 1 + T + + T S 1 = S 1 (c S 1 ) (S 1 1)S 1 (S 1 )S (S 1 (S 1 1))S S 1 1 i T 1 = c S 1, T = c S 1 S 1,, T S 1 = c S 1 S 1 S S S 1 1 S 1 T = T i = T 1 + T + + T S 1 = S 1 (c S 1 S S 1 ) (S 1 1)S 1 (S 1 )S (S 1 (S 1 1))S S 1 1 T 1 = c S 1 S S 1, T = c S 1 S S 1 S 1, T S 1 = c S 1 S S 1 S 1 S S S 1 1 T(c) = T = T 1 + T + + T 1 = (S 1 (c S i )) S 1 1 ( (S 1 i)s i )