Improvement of Matrix Factorization-based Recommender Systems Using Similar User Index

Similar documents
Hotel Recommendation Based on Hybrid Model

Note Set 4: Finite Mixture Models and the EM Algorithm

A Recommender System Based on Improvised K- Means Clustering Algorithm

Music Recommendation with Implicit Feedback and Side Information

A Time-based Recommender System using Implicit Feedback

Clustering and Recommending Services based on ClubCF approach for Big Data Application

Recommender Systems New Approaches with Netflix Dataset

Research Article Novel Neighbor Selection Method to Improve Data Sparsity Problem in Collaborative Filtering

Collaborative Filtering using Euclidean Distance in Recommendation Engine

Robot localization method based on visual features and their geometric relationship

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

Fast Efficient Clustering Algorithm for Balanced Data

Collaborative Filtering based on User Trends

Image Classification using Fast Learning Convolutional Neural Networks

A Novel Model for Home Media Streaming Service in Cloud Computing Environment

Improved Version of Kernelized Fuzzy C-Means using Credibility

Developing Lightweight Context-Aware Service Mashup Applications

Two Collaborative Filtering Recommender Systems Based on Sparse Dictionary Coding

A Robust Cloud-based Service Architecture for Multimedia Streaming Using Hadoop

Available online at ScienceDirect. Procedia Technology 17 (2014 )

Machine Learning using MapReduce

Online Version Only. Book made by this file is ILLEGAL. Design and Implementation of Binary File Similarity Evaluation System. 1.

Recommendation System Using Yelp Data CS 229 Machine Learning Jia Le Xu, Yingran Xu

Particle Swarm Optimization applied to Pattern Recognition

An Improved Parallel Scalable K-means++ Massive Data Clustering Algorithm Based on Cloud Computing

Comparison of Recommender System Algorithms focusing on the New-Item and User-Bias Problem

Implementation of Parallel CASINO Algorithm Based on MapReduce. Li Zhang a, Yijie Shi b

Mixture Models and the EM Algorithm

CLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM

COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION

Document Summarization using Semantic Feature based on Cloud

Property1 Property2. by Elvir Sabic. Recommender Systems Seminar Prof. Dr. Ulf Brefeld TU Darmstadt, WS 2013/14

A Clustering Method with Efficient Number of Clusters Selected Automatically Based on Shortest Path

Parallel Implementation of a Random Search Procedure: An Experimental Study

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

Android App for Smooth Multimedia Recording Service via the Frame Buffer

Improving Results and Performance of Collaborative Filtering-based Recommender Systems using Cuckoo Optimization Algorithm

A New Method For Forecasting Enrolments Combining Time-Variant Fuzzy Logical Relationship Groups And K-Means Clustering

CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING

Open Access Self-Growing RBF Neural Network Approach for Semantic Image Retrieval

Progress Report: Collaborative Filtering Using Bregman Co-clustering

Hybrid Recommendation System Using Clustering and Collaborative Filtering

Research on Cloud Resource Scheduling Algorithm based on Ant-cycle Model

An Efficient Clustering Method for k-anonymization

An Efficient Provable Data Possession Scheme based on Counting Bloom Filter for Dynamic Data in the Cloud Storage

Resource and Performance Distribution Prediction for Large Scale Analytics Queries

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem

Wide Area Ontology Integration Scheme for Reasoning Agents in Surveillance Networks

An Efficient Analysis for High Dimensional Dataset Using K-Means Hybridization with Ant Colony Optimization Algorithm

Generalized Social Networks. Social Networks and Ranking. How use links to improve information search? Hypertext

Case Study on Enhanced K-Means Algorithm for Bioinformatics Data Clustering

The Principle and Improvement of the Algorithm of Matrix Factorization Model based on ALS

CS294-1 Assignment 2 Report

Improved MapReduce k-means Clustering Algorithm with Combiner

Feature Selection Using Modified-MCA Based Scoring Metric for Classification

NORMALIZATION INDEXING BASED ENHANCED GROUPING K-MEAN ALGORITHM

Top-N Recommendations from Implicit Feedback Leveraging Linked Open Data

Diagonal Principal Component Analysis for Face Recognition

Network Intrusion Forensics System based on Collection and Preservation of Attack Evidence

Interest-based Recommendation in Digital Library

SOMSN: An Effective Self Organizing Map for Clustering of Social Networks

Collaborative Filtering Based on Iterative Principal Component Analysis. Dohyun Kim and Bong-Jin Yum*

The Study of Genetic Algorithm-based Task Scheduling for Cloud Computing

A Simple Model for Estimating Power Consumption of a Multicore Server System

Appearance-Based Place Recognition Using Whole-Image BRISK for Collaborative MultiRobot Localization

Unsupervised learning on Color Images

amount of available information and the number of visitors to Web sites in recent years

Parallel Performance Studies for a Clustering Algorithm

Neural Network Regressions with Fuzzy Clustering

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011

CLUSTERING is one major task of exploratory data. Practical Privacy-Preserving MapReduce Based K-means Clustering over Large-scale Dataset

An Efficient Clustering for Crime Analysis

A Scalable, Accurate Hybrid Recommender System

A Novel Texture Classification Procedure by using Association Rules

Post-Classification Change Detection of High Resolution Satellite Images Using AdaBoost Classifier

CHAPTER 4 FUZZY LOGIC, K-MEANS, FUZZY C-MEANS AND BAYESIAN METHODS

Implementation of GP-GPU with SIMT Architecture in the Embedded Environment

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data

Multi-Stage Rocchio Classification for Large-scale Multilabeled

OSDBQ: Ontology Supported RDBMS Querying

Compressed Sensing Algorithm for Real-Time Doppler Ultrasound Image Reconstruction

Face Recognition Based on LDA and Improved Pairwise-Constrained Multiple Metric Learning Method

CHAPTER-6 WEB USAGE MINING USING CLUSTERING

Scalable Clustering of Signed Networks Using Balance Normalized Cut

K-Means Based Matching Algorithm for Multi-Resolution Feature Descriptors

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Distributed Face Recognition Using Hadoop

R. R. Badre Associate Professor Department of Computer Engineering MIT Academy of Engineering, Pune, Maharashtra, India

CHAPTER 6 HYBRID AI BASED IMAGE CLASSIFICATION TECHNIQUES

System For Product Recommendation In E-Commerce Applications

Privacy-Preserving of Check-in Services in MSNS Based on a Bit Matrix

An Empirical Comparison of Collaborative Filtering Approaches on Netflix Data

A Software Testing Optimization Method Based on Negative Association Analysis Lin Wan 1, Qiuling Fan 1,Qinzhao Wang 2

A Study on the Communication Agent Model for One-way Data Transfer System

Text Classification Using Mahout

Parallel learning of content recommendations using map- reduce

An Algorithm of Parking Planning for Smart Parking System

Transcription:

, pp. 71-78 http://dx.doi.org/10.14257/ijseia.2015.9.3.08 Improvement of Matrix Factorization-based Recommender Systems Using Similar User Index Haesung Lee 1 and Joonhee Kwon 2* 1,2 Department of Computer Science, Kyonggi University, San 94-6, Yiui-dong, Yeongtong-u, Suwon-si, Gyeonggi-do, Korea {seastar0202, wonjh}@gu.ac.r Abstract Matrix factorization-based approaches have proven to be efficient for recommender systems. However, due to the time complexity in composing recommendations, matrix factorization-based approaches are inefficient in dealing with large scale datasets. In this paper, we present a new similar user index-based matrix factorization approach for large scale recommender systems. Finding similar users is the most time-consuming phase in large scale recommender systems. To reduce time to find the similar users, we propose a similar user index in matrix factorization. This paper describes the index structure and algorithms. Several experiments are performed. The results show that our approach is more efficient in dealing with the large dataset as compared with matrix factorization approach without the similar user index. Keywords: Recommender systems, Matrix factorization, Large scale dataset, Similar user indexing 1. Introduction Recommender systems suggest items that are liely to interest the users. Generally recommender systems are based on CF (Collaborative Filtering) which is a technique that automatically predicts the interest of an active user by collecting user s historical data. However, most recommender systems based on CF suffer from two limitation, data sparsity and prediction accuracy. MF(Matrix Factorization) has become one of the leading techniques for the data sparsity which is main limitation of CF. However, due to the time complexity in composing recommendations, MF-based approaches are inefficient in dealing with large scale datasets. In this paper, we propose a similar user index in order to efficiently reduce computation time needed in composing recommendations. Also, the use of the similar user index improves the accuracy of prediction because the similar user index is composed with the social networ analysis. The remainder of this paper is organized as follows. Section 2 surveys related wors. Section 3 discusses the structure and algorithms of the similar user index-based MF. Section 4 presents the experimental study and results. Finally, Section 5 concludes this paper. 2. Related Wors Collaborative filtering (CF) is the most successful recommendation technology and is used in many of the most successful recommender systems [1]. In the process of CF, it is * Corresponding Author E-mail: wonjh@gu.ac.r ISSN: 1738-9984 IJSEIA Copyright c 2015 SERSC

one of mainly important phases that a set of similar users is formed. Based on items i j experienced by similar users, CF calculates the predicted preference score of item specific user u i for a. Then, CF produces recommendations which is a list of N items that the u user i would lie the most. However, CF has two fatal limitations which are respectively related with the data sparsity and scalability [2]. Over the last broad decade many CF algorithms have been proposed in order to solve two challenges. Among many proposed approaches, matrix factorixation-based approaches are widely nown for its effectiveness. Matrix factorization (MF) was empirically shown to be a better model than traditional CF approaches. Let A m n R be the rating matrix in a recommender system, where m and n are the number of users and items, respectively. The problem of matrix factorization for recommender system is presented in the equation (1) [3]. T T 2 min ( Aij w i h j ) m W R ( i, j ) n H R 2 2 ( W H ) (1) F F In the equation (1), is the set of indices for observed ratings, is the regularization w T i T h j parameter, and and are the i th and the row vectors of the matrices W and H, respectively. The goal of the equation (1) is to approximate the incomplete matrix A by T WH, where W and H are ran- matrices. In other words, MF learns the model by fitting the previously observed ratings. However, there are overfitting issues because the goal of MF is to generalize those previous ratings in a way that predicts unnown ratings. Besides, the time complexity of MF is in each iterated computations. That mean MF is not efficient for dealing with the large-scale dataset. th j O ( 2 ( m n ) 3. MF-based Recommendations with the Use of Similar User Index The size of user s historical data of recommender systems are rapidly increasing. Most MF-based recommender systems suffer from data explosion. The more historical data increase, the more limitations related with the data sparsity and scalability are issued. Besides, it has nown that computing similarity between a pair of users is the most timeconsumed factor among other operation of recommender algorithms [4]. The huge amount of historical data means more and more computation time is needed in computing user similarity. That is, the large-scale dataset is a main cause of decreasing the performance of recommender systems. For solving the limitations of recommender systems, we try to tae the index mechanism to efficiently handle huge amounts of data. As a complement to conventional information process mechanism, the index mechanism is going to be mainly used in order to improve the performance of systems. To reduce computation time in large scale recommender systems, we propose a similarity user index. Through the similar user index, only useful data for composing recommendations are considered without looing for all data. Therefore, with the user of the similar user index, the computation time of composing similar users is efficiently reduced. Figure 1 depicts the process of the recommender system which uses the similar user index. In the bac-end, the similar user index is composed with the clustering method in advance. Then the similar user index is stored in the storage. In the front-end, when the MF-based recommendation engine receives recommendation requests from client services, the engine accesses the similar user index for looing for similar users. After that, based on similar user s history data, recommender items are composed and then provided to client services. 3 ) 72 Copyright c 2015 SERSC

Figure 1. The Process of the large-scale Recommender System based on the Similar User Index For constructing the similar user index, we propose a similar user clustering. The similar user clustering technique wors by identifying groups of users who appear to have similar preferences. Since each similar user cluster includes similar users, the recommender algorithm only taes data of users who are belonged in the targeted cluster without considering all users who are included in other clusters. Among many clustering techniques, similar user clustering is extended from the - means clustering method [5]. Major tas of similar user clustering is measuring the distance between a central user and another user. If the distance between the user i and the c central user j in the cluster is shorter than other central users in different clusters, the c user i is included in the cluster. The distance is considered as similarity of the user and the central user. For measuring the similarity, we use the square of the Euclidean distance measurement. Equation 2 is for measuring similarity j, based on Euclidean distance measurement. Sim i, j between a pair of users i and Sim i, j 1 (2) { TN ( i ) MN ( i, j )} { TN ( j ) MN ( i, j )} In equation (2), TN is total number of items experienced by a user. MN(i,j) is the number of items experienced by a pair of users i, j in common. Algorithm 1 taes the similar user clustering. Algorithm 1. Similar User Clustering Input: is the number of clusters, U is the user dataset Output: The set of clusters, C 1. Randomly assign each user u as central user in each cluster c C 2. centroid Position( u ) 3. repeat 4. for each user node u i U do 5. for each centroid c do 6. similarity Euclidean ( u, centroid ) i, i 7. centroid ReadjustCentroid ( c ) 8. if similarity is biggest value among other clusters then i, 9. c u i 10. until centroids of clusters do not change 11. return C Copyright c 2015 SERSC 73

centroid In algorithm 1, is the position of central user in cluster. The function, Position returns the location in the cluster. Euclidean measures the distance between a pair of users of each cluster. ReadjustCentroid readjusts the center position in each cluster based on recently computed similarity. Among clusters, the cluster which has the shortest distances between the user u centroid i and the central user is chosen to include the user u i. The clustering algorithm is iterated until the center position of each cluster does not change. Through the similar user clustering, similar user clusters are composed. Then, similar users of each cluster are indexed. Similar clusters have their own cluster ID. And then, we use these cluster IDs as an identifier of index. For fast access to similar user index, we define a hash function. Figure 2 depicts the structure of the similar user index. When a user ID is given, similar users of it are retrieved through the hash function. Retrieved set of similar users means a similar user cluster. As shown in Figure 2, the similar user index consists of a cluster ID dictionary and lists of user IDs. The dictionary of cluster IDs records cluster IDs with their own value of the centroid. The hash function computes similarities between the given user ID and the centroid of each similar user cluster. And then it returns a cluster ID with which the given user has the greatest value of the similarity. Figure 2. The Structure of the Similar User Index Ultimately, recommender systems simply accesses the similar user index with a given user ID and the set of similar users are retrieved without needing to consider the whole user data. Then, the recommender system composes recommendations based on retrieved similar users. As a result, the use of the similar user index helps the recommender system recommends items quicly and accurately without needing to consider the whole user data and to spend computation time in composing similar users. 4. Evaluation In this section, we compare the performance of our approach with that of the previous MF-based recommendation systems without the similar use index. In the experiment, our approach uses the similar user index in MF-based recommender systems. In this paper, our approach is called MF+SimIndex and the previous MF-based recommendation systems is called MF. We evaluated both recommendation accuracy and recommendation time. First, we too the experiment recommendation accuracy for data 74 Copyright c 2015 SERSC

sparsity. Second, we experimented the time of composing recommended items with various data size. Experimental environment is constructed with 2.4 GHZ dual-core Intel Core i5 and 8GB of 1600MHZ DDR3 RAM. In comparative experiments, recommender systems are implemented with Ubuntu 13.1, Hadoop 2.2 and apache mahout 0.8. For handling the large-scale dataset in the distributed environment, we use Hadoop that is the most nown distributed computing open framewor [6]. Also, apache mahout is the machine learning library and provides useful libraries for recommender systems [7]. For the efficient evaluation, we implemented MF-based recommender algorithms with libraries that are provided by apache mahout. Experimental data sets are composed from data of MovieLens that is online video recommendation site [8]. Table 1. Experimental Data Sets for the Evaluation of the Recommendation Accuracy Name The number of ratings Sparsity rate (%) A1 1,000,000 95.531 A2 900,000 85.977 A3 810,000 77.380 A4 729,000 69.642 A5 656,100 62.677 A6 590,490 56.410 A7 531,441 50.769 A8 478,296 45.692 A9 430,467 41.122 A10 387,420 37.010 In the case of the first experiment part, we measure RMSE of two recommender systems respectively to evaluate the performance of the recommendation accuracy. RMSE (root-mean-square error) is the most widely used in evaluating the accuracy of recommendations. If the value of RMSE is high, the accuracy of prediction is considered as being incorrect. Also, we synthetically generate 10 data sets varying by data sparsity. Table 1 lists generated 10 data sets for evaluating the accuracy of recommendations. Data sparsity rate means the degree of data sparsity. If the rate of data sparsity is low, there are much missing data. Missing data is a main cause of reducing the accuracy of recommendations. Figure 3 shows experimental results of evaluating the performance of the accuracy of recommendations with 10 data sets. With the experimental result, it is clear that the recommender system with the use of similar user index performs better than the MF-based recommender system in which the similar user index is not used. As a result, the use of similar user index improves the accuracy of recommendations regardless of data sparsity. This advantage of the similar user index is mainly attributed to the similar user clustering because similar user clustering is not affected by missing data. Copyright c 2015 SERSC 75

Figure 3. The Performance about the Recommendation Accuracy In the second part of experiment, we measure the response time of two recommender systems. The response time of the recommender system means the needed computation time in composing recommendations for a specific user. If the response time is fast, the recommender system is considered to have the good performance. For the experiment about the response time, we also synthetically generate 10 datasets with various data sizes. Table 2 shows those 10 data sets. Table 2. Experimental Data Sets for Evaluating the Response Time of Recommendations Name The number of users Data size (MB) B1 3706 1.040 B2 3595 0.910 B3 3433 0.848 B4 3227 0.784 B5 2952 0.725 B6 2657 0.666 B7 2351 0.610 B8 2069 0.578 B9 2041 0.539 B10 1731 0.493 In the case of the experiment about the response time, we measure the time when the value of RMSE is converged in the certain range. The range is 0.68 RMSE 0. 6. Fig. 4 comparatively shows the experimental results between two recommender systems. We can see that the recommender system with the use of similar user index is also superior to the recommender system in which only general MF is applied without the use of similar user index. In order words, the use of similar user index improve the respons time of recommendation in all experimental data sets. Besides, the recommender system with the use of similar user index is relatively not affected by data size. In the case of the recommenr system which does not use the similar user index, however, the more data size increase, the more the response time is slower. 76 Copyright c 2015 SERSC

Figure 4. The Performance about the Response Time Therefore, we now that the use of similar user index improves the performance of recommender systems regardless of data sizes. The improvement of the response time is mainly contributed to the use of the similar user index. With the given user ID, the recommender system simply access the similar user index and received the set of similar users without needing to consider the whole user data and to compute the similarity between users. 5. Conclusion Data sparsity and prediction quality have been recognized as one of the crucial challenges that most recommender systems confront. Matrix factorization (MF) has become one of the leading techniques for solving the issues of data sparsity. However, due to the time complexity in composing recommendations, MF-based approaches are inefficient in dealing with a huge amount of historical data. In this paper, we propose a similar user index. The similar user index is efficiently used for reducing useless data in the calculation for composing recommendations. With the given user ID, the recommender system simply access the similar user index and received the set of similar users without needing to consider the whole user data and to compute the similarity between users. With the experimental results, we new that the use of similar user index improves the accuracy of recommendations without being affected by data sparsity. Also, the similar user index helps the recommender system composes recommendation in faster way regardless of increasing data sizes. In conclusion, the use of proposed approach improves the performance of the MF-based recommender systems. Also, the similar user index is usefully applied in the large-scale recommender systems that handle a huge amount of data. Acnowledgments This wor was supported by the Gyonggi Regional Research Center (GRRC) of Korea and Contents Convergence Software (CCS) research center of Korea. References [1] P. Resnic, GroupLens: an open architecture for collaborative filtering of netnews, Proceedings of the 1994 ACM conference on Computer supported cooperative wor, ACM, (1994), pp. 175-186. [2] S. Badrul, Item-based collaborative filtering recommendation algorithms, Proceedings of the 10th international conference on World Wide Web. ACM, (2001), pp. 285-295. [3] K. Yehuda, R. Bell and C. Volinsy, Matrix factorization techniques for recommender systems, Computer, vol. 42, no. 8, (2009), pp. 30-37. Copyright c 2015 SERSC 77

[4] F. Fouss, A. Pirotte and M. Saerens, A novel way of computing similarities between nodes of a graph, with application to collaborative recommendation, Web Intelligence, Proceedings of the 2005 IEEE/WIC/ACM International Conference, (2005), pp. 550-556. [5] C. Bouras and V. Tsogas, Clustering user preferences using W-means, 2011 Seventh International Conference on Signal-Image Technology and Internet-Based Systems (SITIS), (2011), pp. 75-82. [6] hadoop, http://hadoop.apache.org/. [7] Apache mahout, https://mahout.apache.org/. [8] GroupLense, http://grouplens.org/datasets/. Authors Lee Haesung, is a Ph.D. candidate of Computer Science at Kyonggi University, Korea. Her research areas include Contextaware Computing, Social Networ, Information Retrieval and Mobile Computing. She wors for software development on areas of data search in ubiquitous environment. She received her B.S., M.S. in Computer Science from Kyonggi University, Korea. Contact her at seastar0202@ yonggi.ac.r. Kwon Joonhee, is a professor of Computer Science at Kyonggi University, Korea. She was a visiting research professor at the Computer Science Department at New Jersey Institute of Technology. Her research areas include Context-aware Computing, Information Retrieval, Social Networ, Web 2.0 and Mobile Database. Her research projects focus on areas of d1ata search using social networ and Web 2.0 in ubiquitous environment. She received her B.S., M.S. and Ph.D. in Computer Science from Soomyung Women s University, Korea. Contact her at wonjh@yonggi.ac.r. 78 Copyright c 2015 SERSC