filtering LETTER An Improved Neighbor Selection Algorithm in Collaborative Taek-Hun KIM a), Student Member and Sung-Bong YANG b), Nonmember

Similar documents
A Time-based Recommender System using Implicit Feedback

Comparison of Recommender System Algorithms focusing on the New-Item and User-Bias Problem

Collaborative Filtering using a Spreading Activation Approach

Image Segmentation using K-means clustering and Thresholding

Online Appendix to: Generalizing Database Forensics

Skyline Community Search in Multi-valued Networks

Shift-map Image Registration

Uninformed search methods

A Constrained Spreading Activation Approach to Collaborative Filtering

Cluster Center Initialization Method for K-means Algorithm Over Data Sets with Two Clusters

1 Surprises in high dimensions

An Algorithm for Building an Enterprise Network Topology Using Widespread Data Sources

Divide-and-Conquer Algorithms

THE APPLICATION OF ARTICLE k-th SHORTEST TIME PATH ALGORITHM

Shift-map Image Registration

APPLYING GENETIC ALGORITHM IN QUERY IMPROVEMENT PROBLEM. Abdelmgeid A. Aly

Particle Swarm Optimization Based on Smoothing Approach for Solving a Class of Bi-Level Multiobjective Programming Problem

Learning Polynomial Functions. by Feature Construction

Data Mining: Clustering

Research Article Research on Law s Mask Texture Analysis System Reliability

Non-homogeneous Generalization in Privacy Preserving Data Publishing

Multilevel Linear Dimensionality Reduction using Hypergraphs for Data Analysis

Feature Extraction and Rule Classification Algorithm of Digital Mammography based on Rough Set Theory

Almost Disjunct Codes in Large Scale Multihop Wireless Network Media Access Control

Generalized Edge Coloring for Channel Assignment in Wireless Networks

NEWTON METHOD and HP-48G

Yet Another Parallel Hypothesis Search for Inverse Entailment Hiroyuki Nishiyama and Hayato Ohwada Faculty of Sci. and Tech. Tokyo University of Scien

NEW METHOD FOR FINDING A REFERENCE POINT IN FINGERPRINT IMAGES WITH THE USE OF THE IPAN99 ALGORITHM 1. INTRODUCTION 2.

Rough Set Approach for Classification of Breast Cancer Mammogram Images

A shortest path algorithm in multimodal networks: a case study with time varying costs

d 3 d 4 d d d d d d d d d d d 1 d d d d d d

Classifying Facial Expression with Radial Basis Function Networks, using Gradient Descent and K-means

Research Article Inviscid Uniform Shear Flow past a Smooth Concave Body

Investigation into a new incremental forming process using an adjustable punch set for the manufacture of a doubly curved sheet metal

CS 106 Winter 2016 Craig S. Kaplan. Module 01 Processing Recap. Topics

Robustness and Accuracy Tradeoffs for Recommender Systems Under Attack

Design of Policy-Aware Differentially Private Algorithms

Short-term prediction of photovoltaic power based on GWPA - BP neural network model

Here are a couple of warnings to my students who may be here to get a copy of what happened on a day that you missed.

New Version of Davies-Bouldin Index for Clustering Validation Based on Cylindrical Distance

Classification and clustering methods for documents. by probabilistic latent semantic indexing model

Solution Representation for Job Shop Scheduling Problems in Ant Colony Optimisation

The Reconstruction of Graphs. Dhananjay P. Mehendale Sir Parashurambhau College, Tilak Road, Pune , India. Abstract

Fast Window Based Stereo Matching for 3D Scene Reconstruction

Image compression predicated on recurrent iterated function systems

Privacy-Preserving Collaborative Filtering using Randomized Perturbation Techniques

Additional Divide and Conquer Algorithms. Skipping from chapter 4: Quicksort Binary Search Binary Tree Traversal Matrix Multiplication

Collaborative Filtering based on User Trends

An Efficient Neighbor Searching Scheme of Distributed Collaborative Filtering on P2P Overlay Network 1

Coupling the User Interfaces of a Multiuser Program

A New Search Algorithm for Solving Symmetric Traveling Salesman Problem Based on Gravity

BIJECTIONS FOR PLANAR MAPS WITH BOUNDARIES

Recommender System using Collaborative Filtering Methods: A Performance Evaluation

NAND flash memory is widely used as a storage

A Constrained Spreading Activation Approach to Collaborative Filtering

On the Placement of Internet Taps in Wireless Neighborhood Networks

Influence in Ratings-Based Recommender Systems: An Algorithm-Independent Approach

Comparison of Methods for Increasing the Performance of a DUA Computation

Evolutionary Optimisation Methods for Template Based Image Registration

amount of available information and the number of visitors to Web sites in recent years

Fuzzy Clustering in Parallel Universes

Problem Paper Atoms Tree. atoms.pas. atoms.cpp. atoms.c. atoms.java. Time limit per test 1 second 1 second 2 seconds. Number of tests

Slope One Predictors for Online Rating-Based Collaborative Filtering

A Cost Model For Nearest Neighbor Search. High-Dimensional Data Space

THE BAYESIAN RECEIVER OPERATING CHARACTERISTIC CURVE AN EFFECTIVE APPROACH TO EVALUATE THE IDS PERFORMANCE

A Neural Network Model Based on Graph Matching and Annealing :Application to Hand-Written Digits Recognition

Pairwise alignment using shortest path algorithms, Gunnar Klau, November 29, 2005, 11:

Loop Scheduling and Partitions for Hiding Memory Latencies

Lecture 1 September 4, 2013

Backpressure-based Packet-by-Packet Adaptive Routing in Communication Networks

EFFICIENT STEREO MATCHING BASED ON A NEW CONFIDENCE METRIC. Won-Hee Lee, Yumi Kim, and Jong Beom Ra

Transient analysis of wave propagation in 3D soil by using the scaled boundary finite element method

Optimal Oblivious Path Selection on the Mesh

Considering bounds for approximation of 2 M to 3 N

Solving the Sparsity Problem in Recommender Systems Using Association Retrieval

Dual Arm Robot Research Report

Random Clustering for Multiple Sampling Units to Speed Up Run-time Sample Generation

A fast embedded selection approach for color texture classification using degraded LBP

Modifying ROC Curves to Incorporate Predicted Probabilities

Learning Subproblem Complexities in Distributed Branch and Bound

Variable Independence and Resolution Paths for Quantified Boolean Formulas

Bends, Jogs, And Wiggles for Railroad Tracks and Vehicle Guide Ways

SURVIVABLE IP OVER WDM: GUARANTEEEING MINIMUM NETWORK BANDWIDTH

Indexing the Edges A simple and yet efficient approach to high-dimensional indexing

6.854J / J Advanced Algorithms Fall 2008

Comparing State-of-the-Art Collaborative Filtering Systems

The Journal of Systems and Software

Improving Web Search Ranking by Incorporating User Behavior Information

Collaborative Filtering: A Comparison of Graph-Based Semi-Supervised Learning Methods and Memory-Based Methods

6 Gradient Descent. 6.1 Functions

Reconstructing the Nonlinear Filter Function of LILI-128 Stream Cipher Based on Complexity

Lab work #8. Congestion control

Applying a linguistic operator for aggregating movie preferences

Generalized Edge Coloring for Channel Assignment in Wireless Networks

Performance Modelling of Necklace Hypercubes

A Revised Simplex Search Procedure for Stochastic Simulation Response Surface Optimization

Nearest Neighbor Search using Additive Binary Tree

arxiv: v1 [cs.ir] 1 Jul 2016

Exploring Context with Deep Structured models for Semantic Segmentation

Improving Performance of Sparse Matrix-Vector Multiplication

Transcription:

107 IEICE TRANS INF & SYST, VOLE88 D, NO5 MAY 005 LETTER An Improve Neighbor Selection Algorithm in Collaborative Filtering Taek-Hun KIM a), Stuent Member an Sung-Bong YANG b), Nonmember SUMMARY Nowaays, customers spen much time an effort in fining the best suitable goos since more an more information is place online To save their time an effort in searching the goos they want, a customize recommener system is require In this paper we present an improve neighbor selection algorithm that exploits a graph approach The graph approach allows us to exploit the transitivity of similarities The algorithm searches more efficiently for set of influential customers with respect to a given customer We compare the propose recommenation algorithm with other neighbor selection methos The experimental results show that the propose algorithm outperforms other methos key wors: recommener system, neighbor selection algorithm, collaborative filtering 1 Introuction Nowaays, customers spen much time an effort in fining the best suitable goos since more an more information is place on-line To save their time an effort in searching the goos they want, a customize recommener system is require A customize recommener system shoul preict the most suitable goos for customers by retrieving an analyzing their preferences A recommener system utilizes in general an information filtering technique calle collaborative filtering Collaborative filtering is wiely use for such recommener systems as Amazoncom an CD- Nowcom [1] [4] A recommener system using collaborative filtering which we call CF, is base on the ratings of the customers who have similar preferences with respect to agiven(test)customer CF calculates the similarity between the test customer an every other customer who has rate the items that are alreay rate by the test customer CF uses in general the Pearson correlation coefficient for calculating the similarity A weak point of the pure CF is that it uses all other customers incluing useless customers as the neighbors of the test customer Since CF is base on the ratings of the neighbors who have similar preferences, it is very important to select the neighbors properly to improve preiction quality There have been many investigations on how to select proper neighbors base on neighbor selection methos such as the nearest metho an the clustering-base metho They are quite popular techniques for recommener sys- Manuscript receive June 3, 004 Manuscript revise November 3, 004 The authors are with the Dept of Computer Science, Yonsei Univ, Korea a) E-mail: kimthun@csyonseiackr b) E-mail: yang@csyonseiackr DOI: 101093/ietisy/e88 5107 tems These techniques then preict customer s preferences for the items base on the results of the neighbors evaluation on the same items [1], [4] [6], [9] With millions of customers an items, a recommener system running on an existing algorithm will sufferseriously the scalability problem Therefore, there are emans for a new approach that can quickly prouce high quality preictions an can resolve the very large-scale problem Clustering techniques often lea to worse accuracy than other methos [7], [9], [1] Once clustering is one, however, performance can be very goo, since the size of a cluster that must be analyze is much smaller Therefore the clustering-base metho can solve the very large-scale problem in recommener systems In this paper we present an improve recommenation algorithm using a graph approach The propose algorithm applies the k-means clustering metho to the input ataset for fining a hanful of promising clusters each of which hols a set of customers with influential similarities for the test customer It then searches the promising clusters, as if they are graphs, in a breath-first manner to extract certain customers, calle the neighbors, whohaveeitherhigher similarities or lower similarities with respect to the test customer A cluster is consiere a graph in which vertices are customers an each ege has a weight representing the similarity between the en points(customers) of the ege The graph approach allows us to exploit the transitivity of similarities To compare the performance of the clustering-base metho with that of our metho, the EachMovie ataset of the Compaq Computer Corporation has been use [10] The EachMovie ataset consists of,811,983 preferences for 1,68 movies rate by 7,916 customers explicitly The experimental results show that the propose recommenation algorithm provies better preiction quality than other methos The rest of this paper is organize as follows Section escribes briefly CF an the clustering-base neighbor selection metho Section 3 illustrates the propose neighbor selection algorithm in etail In Sect 4, the experimental results are presente The conclusions are given in Sect 5 Clustering-Base Neighbor Selection in CF 1 Collaborative Filtering CF recommens items through builing the profiles of the Copyright c 005 The Institute of Electronics, Information an Communication Engineers

LETTER 1073 customers from their preferences for each item In CF, preferences are represente generally as numeric values which are rate by the customers Preicting the preference for a certain item that is new to the test customer is base on the ratings of other customers for the target item Therefore, it is very important to fin a set of customers, calle neighbors, with more similar preferences to the test customer for better preiction quality In CF, Eq (1) is use to preict the preference of a customer Note that in the following equation w a,k is the Pearson correlation coefficient as in Eq () [] [5], [7] {w a,k (r k,i r k )} P a,i = r a + w a,k = k w a,k k (r a, r a )(r k, r k ) (1) () (r a, r a ) (r k, r k ) In the above equations P a,i is the preference of customer a with respect to item i r a an r k are the averages of customer a s ratings an customer k s ratings, respectively r k,i an r k, are customer k s ratings for items i an, respectively, an r a, is customer a s rating for item If customers a an k have similar ratings for an item, w a,k > 0 We enote w a,k to inicate how much customer a tens to agree with customer k on the items that both customers have alreay rate In this case, customer a is a positive neighbor with respect to customer k, an vice versa If they have opposite ratings for an item, then w a,k < 0 Similarly, customer a is a negative neighbor with respect to customer k, an vice versa In this case w a,k inicates how much they ten to isagree on the item that both again have alreay rate Hence, if they on t correlate each other, then w a,k = 0 Note that w a,k can be in between -1 an 1 inclusive The Clustering-Base Neighbor Selection The k-means clustering metho creates k clusters each of which consists of the customers who have similar preferences among themselves In this metho we first select k customers arbitrarily as the initial center points of the k clusters, respectively Then each customer is assigne to a cluster in such a way that the istance between the customer an the center of a cluster is minimize The istance is calculate using the Eucliean istance, that is, a square root of the element-wise square of the ifference between the customer an each center point We then calculate the mean of each cluster base on the customers who currently belong to the cluster The mean is now consiere as the new center of the cluster After fining the new center of each cluster, we compute the istance between the new center an each customer as before in orer to fin the cluster to which the customer shoul belong Recalculating the means an computing the istances are repeate until a terminating conition is met The conition is in general how far each new center has move from the previous center; that is, if all the new centers move within a certain istance, we terminate the loop If the clustering process is terminate, we choose the cluster with the shortest Eucliean istance from its center to the test customer Finally, preiction for the test customer is calculate with all the customers in the chosen cluster The clustering-base neighbor selection metho can give a recommenation quickly to the customer in case of the very large-scale ataset, because it selects customers in the best cluster only as neighbors [9] 3 The Propose Neighbor Selection Algorithm We first escribe a neighbor selection algorithm(nsa) in [8] It consiers both high an low similarities with respect to the test customer an exploits the transitivity of similarity using a graph approach We then explain the computational complexity of the NSA 31 NSA The key ieas of NSA are to exploit the transitivity of similarities an to consier both higher an lower similarities in selecting neighbors We regar a portion of the input ataset a complete unirecte graph in which a vertex represents each customer an a weighte ege correspons to the similarity between two en points(customers) of the ege NSA creates k clusters from the input ataset with the k-means clustering metho Then it fins the best cluster C with respect to the test customer u among the k clusters NSA searches the customers in C to fin the best vertex v with the highest similarity with respect to u NSA then searches the unmarke vertices aacent to v who have the similarities either larger than H or smaller than L, where H an L are some threshol values for the Pearson correlation coefficients Note that as the threshol values change, so oes the size of the neighbors The search is performe in a breath-first manner That is, we search the aacent vertices of v accoring to H an L to fin the neighbors of u, an then search the aacent vertices of each neighbor of v in turn The search stops when we have enough neighbors for preiction Figure 1 epicts an example of how NSA selects the neighbors of v who has the higher similarities with respect to u when the search epth is A soli line inicates that the weight on the ege is larger than H A otte line means that the weight of the ege is smaller than L We now escribe NSA in etail In the algorithm, before we apply the graph approach to the input ataset, we want to remove insignificant customers with the k-means clustering metho After we obtain k clusters with the k- means clustering metho, we choose a cluster (the best

1074 IEICE TRANS INF & SYST, VOLE88 D, NO5 MAY 005 (a) Assume that v has the highest similarity for the test customer Fig 1 (b) After fining the neighbors of v (the search epth=1) (c) After selecting the neighbors of each neighbor of v (the search epth=) An example of searching the neighbors in NSA cluster) whose mean has the highest similarity with respect to the test customer u We then apply the graph approach only to the best cluster The following escribes the overall algorithm in etail When the algorithm is terminate, the set Neighbors is returne as output The Neighbor Selection Algorithm Input: the test customer u, the input ataset S Output: Neighbors 1 Create k clusters from S with the k-means clustering metho; Fin the best cluster C for the test customer u; 3 Fin the best customer v in cluster C who has the highest similarity with respect to u; 4 A v to Neighbors; 5 Traverse C from v in a breath-first manner when visiting vertices(customers) The similarity of the customer is checke to see if it is either higher than H or lower than L If so, a the customer to Neighbors; Note that Step5 is execute until we get enough neighbors an such terminating conition can be impose with how many levels(epths) we search from v in a breath-first manner 3 The Computational Complexity of the NSA The k-means clustering algorithm is arguably the most wellknown clustering algorithm in use toay Its computational complexity is O(krn), where k is the number of esire clusters, r is the number of iterations, an n is the size of the ataset [11] In collaborative filtering the complexity for the preiction epens on the size of the neighbors Therefore, if we consier all the points (customers) in the chosen cluster (the best cluster) as the neighbors, the complexity is only O(c), where c is the size of the chosen cluster The complexity of the breath-first search (BFS) in a complete unirecte graph is O(n ), where n is the size of the vertex set NSA searches the neighbors with transitivity in a graph using BFS an stops the search if the search epth is greater than a threshol value preetermine BFS for the neighbors using the transitivity mechanism in a complete unirecte graph takes O(n), where is the search epth If we consier only the chosen cluster, the search complexity is O(c) Although NSA searches the neighbors using BFS, it searches only not selecte vertices, that is, all vertices selecte as the neighbors in the previous search steps are consiere as the selecte vertices Therefore, the complexity T(c) of NSA can be compute with (3) T(c) is erive from T = T 1 +T ++T,whereT is the complexity when the search epth is T(c) is compute by each step of the following appenix In (3) an the appenix S is the number of the selecte vertices when the search epth is an S i is the number of the selecte vertices for each vertex i of S 1 For each epth if there is only one selecte vertex, then T(c) iso(c), which is the worst case If all vertices are selecte when the search epth is one, then T(c)isO(c), which is the best case 1 T(c) = (S 1 (c S i )) S 1 1 ( (S 1 i)s i ), (3) where S 0 = 1, S = c, an i S i = S 4 Experimental Results 41 Experiment Environment In orer to evaluate the preiction accuracy of the propose recommenation algorithm, we use the EachMovie ataset of the Digital Equipment Corporation for experiments [10] The EachMovie ataset consists of,811,983 preferences for 1,68 movies rate by 7,916 customers explicitly The customer preferences are represente as numeric values from 0 to 1 at an interval of 0, ie, 00, 0, 04, 06, 08, an 1 In the EachMovie ataset, one of the valuable attributes of an item is the genre of a movie There are 10 ifferent genres such as action, animation, artforeign, classic, comey, rama, family, horror, romance, an thriller For the experiment, we retrieve 3,763 customers who rate at least 100 movies among all the customers in the ataset We have chosen ranomly 50 customers as the test customers an the rest customers are the training customers For each test customer, we chose 5 movies ranomly that are actually rate by the test customer as the test movies The final experimental results are average over the results of the test sets 4 Experimental Metrics One of the statistical preiction accuracy metrics for evaluating recommener systems is the mean absolute error(mae) which is the mean of the errors of the actual customer ratings against the preicte ratings in iniviual preiction [1], [5] [7] MAE is compute by (4) In the equation, N is the total number of preictions an ε i is the error

LETTER 1075 Table 1 Methos MAE Parameters Experimental results PCF 0190479 kmcf 017146 k = 1 GCF 0178906 H = 08, = NSA 0165700 k = 1, H = 07, L = 08, = Fig The sensitivity of the number of clusters k between the preicte rating an the actual rating for item i The lower MAE is, the more accurate preiction with respect to the numerical ratings of customers we get E = N ε i i=0 N 43 Experimental Results (4) We compare the propose recommenation algorithm with two other methos The first metho is the pure collaborative filtering metho (PCF) use by the GroupLens [], [3] PCF consiers all the customers in the input ataset as neighbors without clustering The secon metho is a moifie PCF that utilizes the k-means clustering which we call kmcf For the propose algorithm we have two ifferent settings The first one is NSA The other is NSA without clustering which we call GCF The experimental results are shown in Table 1 We etermine that k=1 for both kmcf an NSA through various experiments That is, this value for k gave us the smallest MAE value among others as shown in Fig Among the parameters in the table, L an H enote that the threshol values for the Pearson correlation coefficients For example, if L = 08 anh = 07, then we select a neighbor whose similarity is either smaller than 08orlargerthan07 In the table is a search epth The results in the table show that NSA outperforms others when the search epth is We foun that when the search epth is grater than, NSA i not provie better results, because as the size of the neighbors gets larger the chances that unnecessary customers are inclue in the neighbors get higher Table 1 also shows that the methos with the graph approach using transitivity of similarities which are GCF an NSA, fin valuable neighbors in both PCF an kmcf 5 Conclusions It is crucial for a recommener system to have the capability of making accurate preiction by retrieving an analyzing customers preferences Collaborative filtering is wiely use for recommener systems Hence various efforts to overcome its rawbacks have been mae to improve preiction quality It is important to select neighbors properly in orer to make up the weak points in collaborative filtering an to improve preiction quality In this paper we propose an improve neighbor selection algorithm that fins meaningful neighbors using a graph approach along with clustering The experimental results show the propose algorithm provies the better preiction quality than other methos The propose metho utilizes a clustering technique an transitivity mechanism Therefore, it can be a choice to solve the very large-scale problem because it reuces the search space using clustering for a large ataset an also prouces high preiction quality using the transitivity of similarities The complexity of the propose metho is not high However, it may be increase a little more than that of a clustering metho in the worst case Acknowlegments We thank the Compaq Equipment Corporation for permitting us to use the EachMovie ataset This work was supporte by the Brain Korea 1 Proect in 004 References [1] BM Sarwar, G Karypis, JA Konstan, an JT Riele, Application of imensionality reuction in recommener system A case stuy, Proc ACM WebKDD 000 Web Mining for E-Commerce Workshop, pp8 90, 000 [] J Konstan, B Miller, D Maltz, J Herlocker, L Goron, an J Riel, GroupLens: Applying collaborative filtering to usenet news, Commun ACM, vol40, pp77 87, 1997 [3] P Resnick, N Iacovou, M Suchak, P Bergstrom, an J Riel, GroupLens: An open architecture for collaborative filtering of netnews, Proc ACM CSCW94 Conference on Computer Supporte Cooperative Work, pp175 186, 1994 [4] BM Sarwar, G Karypis, JA Konstan, an JT Riel, Analysis of recommenation algorithms for E-commerce, Proc ACM E- commerce 000 Conference, pp158 167, 000 [5] JL Herlocker, JA Konstan, A Borchers, an J Riel, An algorithmic framework for performing collaborative filtering, Proc n International ACM SIGIR Conference on Research an Development in Information Retrieval, pp30 37, 1999 [6] M O Connor an J Herlocker, Clustering items for collaborative filtering, Proc ACM SIGIR Workshop on Recommener Systems, 1999 [7] JS Breese, D Heckerman, an C Kaie, Empirical analysis of preictive algorithms for collaborative filtering, Proc Conference on Uncertainty in Artificial Intelligence, pp43 5, 1998 [8] T-H Kim, Y-S Ryu, S-I Park, an S-B Yang, An improve recommenation algorithm in collaborative filtering, Lecture Notes

1076 IEICE TRANS INF & SYST, VOLE88 D, NO5 MAY 005 in Computer Science, vol455, pp54 61, 00 [9] BM Sarwar, G Karypis, J A Konstan, an JT Riele, Recommener systems for large-scale e-commerce: Scalable neighborhoo formation using clustering, Proc Fifth International Conference on Computer an Information Technology, 00 [10] S Glassman, EachMovie collaborative filtering ata set, Compaq Computer Corporation, url:http://research compaqcom/src/eachmovie/ 1997 [11] S Kantabutra, Cluster computing as a tool in theoretical computer science, Proc National Workshop on Cluster Coputing 003, pp1 16, 003 [1] T-H Kim an S-B Yang, Using attributes to improve preiction quality in collaborative filtering, Lecture Notes in Computer Science, vol318, pp1 10, 004 Appenix: The computational complexity T(c) T(c) = T = T 1 + T + + T S 0 = 1, S = c, an S i = S T 1 = c S 1 T = T i = T 1 + T + + T S 1 = S 1 (c S 1 ) (S 1 1)S 1 (S 1 )S (S 1 (S 1 1))S S 1 1 i T 1 = c S 1, T = c S 1 S 1,, T S 1 = c S 1 S 1 S S S 1 1 S 1 T = T i = T 1 + T + + T S 1 = S 1 (c S 1 S S 1 ) (S 1 1)S 1 (S 1 )S (S 1 (S 1 1))S S 1 1 T 1 = c S 1 S S 1, T = c S 1 S S 1 S 1, T S 1 = c S 1 S S 1 S 1 S S S 1 1 T(c) = T = T 1 + T + + T 1 = (S 1 (c S i )) S 1 1 ( (S 1 i)s i )