Hybrid Tag Recommendation for Social Annotation Systems

Hybrid Tag Recommendation for Social Annotation Systems Jonathan Gemmell, Thomas Schimoler, Bamshad Mobasher, Robin Burke Center for Web Intelligence School of Computing, DePaul University Chicago, Illinois, USA {jgemmell, tschimo1, mobasher, rburke}@cdm.depaul.edu ABSTRACT Social annotation systems allow users to annotate resources with personalized tags and to navigate large and complex information spaces without the need to rely on predefined hierarchies. These systems help users organize and share their own resources, as well as discover new ones annotated by other users. Tag recommenders in such systems assist users in finding appropriate tags for resources and help consolidate annotations across all users and resources. But the size and complexity of the data, as well as the inherent noise and inconsistencies in the underlying tag vocabularies, have made the design of effective tag recommenders a challenge. Recent efforts have demonstrated the advantages of integrative models that leverage all three dimensions of a social annotation system: users, resources and tags. Among these approaches are recommendation models based on matrix factorization. But, these models tend to lack scalability and often hide the underlying characteristics, or information channels of the data that affect recommendation effectiveness. In this paper we propose a weighted hybrid tag recommender that blends multiple recommendation components drawing separately on complementary dimensions, and evaluate it on six large real-world datasets. In addition, we attempt to quantify the strength of the information channels in these datasets and use these results to explain the performance of the hybrid. We find our approach is not only competitive with the state-of-the-art techniques in terms of accuracy, but also has the added benefits of being scalable to large real world applications, extensible to incorporate a wide range of recommendation techniques, easily updateable, and more scrutable than other leading methods. Categories and Subject Descriptors H.2 [Database Management]: H.2.8 Database application Data mining; H.3 [Information Storage and Retrieval]: H.3.3 Information Search and Retrieval Search process Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CIKM 10, October 26 30, 2010, Toronto, Ontario, Canada. Copyright 2010 ACM 978-1-4503-0099-5/10/10...$10.00. General Terms Algorithms, Experimentation, Performance Keywords Social Annotation, Information Channels, Hybrid Recommenders, Recommender Systems 1. INTRODUCTION In social annotation systems, information access functions such as search, navigation and resource sharing are supported by annotations, arbitrary tags applied to resources by individual users. Delicious 1 supports users as they bookmark URLs. Citeulike 2 enables researchers to manage scholarly references. Bibsonomy 3 allows users to annotate both. Social annotation systems are quickly becoming ubiquitous in a variety of domains. For example, Amazon 4 and others have incorporated social annotations into their web site. The popularity of social annotation systems is driven in part by the low entry barrier and the freedom to annotate resources with any tag. The aggregated connections between users, resources and tags provide a rich information space for users to explore. However, the benefits of social annotation systems do not come without a cost. The size, noise and dimensionality of the data make navigation and information access difficult. Recommender systems are therefore a critical component of these applications. In this work we focus on tag recommenders which assist users during the annotation process by recommending tags for a selected resource. Recent efforts in tag recommendation have proven that integrative models that leverage all three dimensions of a social annotation system (users, resources, tags) produce superior results. Graph-based models [7] and a variety of latent variable techniques [13, 14, 18, 19] have been investigated. These approaches tend to be computationally intensive and scale poorly. Previous work on hybrid tag recommenders [3, 4] that combine several components, each exploiting different dimensions of the data, have been shown to offer competitive results while maintaining the simplicity, computational efficiency and explanatory capacity of the component recommenders. However, these results have focused on hybrid models specifically designed for a particular dataset or as a means to augment another integrative technique. 1 delicious.com 2 citeulike.org 3 bibsonomy.org 4 amazon.com

This paper proposes a framework for constructing linear weighted hybrids that combines component recommenders into a single integrated model. No individual component is required to cover all dimensions of the data, but when taken together the components complement one another. The hybrid is therefore able to produce results superior to what the components produce alone. To help understand our experimental results, we explore the notion of an information channel: the power one dimension possesses in predicting or modeling another dimension of the social annotation system. To quantify the strength of these information channels, we develop a family of metrics based on conditional entropy. These metrics reveal marked differences in the characteristics of the datasets, which are reflected in the performance of the recommendation components. The rest of the paper is organized as follows. In Section 2 we present related work. Section 3 introduces tag recommendation. Results of the recommendation techniques are provided in Section 4. In Section 5 we introduce the notion of information channels and use them to evaluate the results of the tag recommenders. Finally, we conclude the paper with a summary of our findings. 2. RELATED WORK One of the first techniques to demonstrate the value of an integrative approach for tag recommendation in social annotation systems was a graph-based variant [7] of the wellknown PageRank algorithm. The computational burden of computing the PageRank values of each user, resource and tag for every recommendation makes the algorithm ill-suited for large-scale deployment. Tensor factorization is another integrative technique for making tag recommendations. Tucker decomposition is one such example that factors the three dimensional tagging data into three feature spaces and a core residual tensor [18, 19]. Unlike the graph-based model, online computation of recommendations is highly efficient. However, the offline computation required to build the model is not scalable to the demands of real-world applications. A pair-wise interaction tensor factorization model has also been proposed, which offers far more reasonable run times in both the construction of the model and the generation of recommendations [13, 14]. It optimizes the ranking of tags given user-resource pairs in the training data. Tags may then be recommended for a new user-resource pair. This approach represents the current state-of-the-art in tag recommendation providing both a high degree of accuracy and computational efficiency. Our previous work in tag recommendation has demonstrated the benefits of hybrid recommenders [3, 4]. One approach demonstrated that the graph-based models may be improved by incorporating item-based collaborative filtering. Another effort resulted in a hybrid recommender for Bibsonomy in the context of the PKDD-ECML 2009 challenge [10]. In this paper we extend those efforts, proposing a more general framework for constructing linear weighted hybrid tag recommenders. The hybrid is constructed from component recommenders and produces results competitive to state-of-the-art techniques. 3. TAG RECOMMENDATION This section begins with a discussion of the data models for a social annotation system. We then present our proposed framework for the linear weighted hybrid tag recommender and discuss the individual components that may be incorporated into the hybrid. For comparative purposes we also describe the state-of-the-art pair-wise interaction tensor factorization algorithm. 3.1 Data Model The foundation of a social annotation system is the annotation: the record of a user labeling a resource with one or more tags. A collection of annotations results in a complex network of interrelated users, resources and tags [11]. A social annotation system can be described as a four-tuple: U, R, T, A, where U is a set of users; R is a set of resources; T is a set of tags; and A is a set of annotations. An annotation contains a user, resource and all tags the user applied to the resource. A social annotation system can also be viewed as a three dimensional matrix, URT, in which an entry URT(u,r,t) is 1 if u tagged r with t. Aggregate projections of the data can be constructed, reducing the dimensionality but sacrificing information [12]. For example, the relation between resources and tags can be defined as RT (r, t). In this work, we calculate RT (r, t) as the number of users that have applied t to r. This notion strongly resembles the bag-of-words vector space model [15] and is analogous to the idea of term frequency common in information retrieval. A similar two dimensional projection can be constructed for UT, in which an entry contains the number of times a user has applied a tag to any resource. Finally, UR is a binary matrix indicating whether or not a user has annotated a resource. An alternative approach would be to define an entry in the matrix as the number of tags a user has applied to a resource. Our previous work and continued experimentation has shown that the binary model for UR produces better results. Each resource, r, may be modeled as a vector over the multi-dimensional space of tags, where a weight, w(t i), in dimension i corresponds to the importance of a particular tag, t i: r t = w(t 1),w(t 2)...w(t T ) (1) Similarly, a resource can be modeled as a vector over the space of users where each weight, w(u i), corresponds to the importance of a particular user, u i to produce r u. Analogous vector models can be constructed for users ( u r, u t ) and tags ( t u, t r ). We draw the weights directly from the previously constructed aggregate projections UR,UT and RT. The model of a user, resource or tag is defined as a row or column taken from one of the projections. 3.2 Linear Weighted Hybrid Our proposed framework aggregates the results of several component recommenders in linear combination [1]. The components are freed from the burden of covering all the available dimensions of the data and instead specialize in only a few. A successful hybrid creates a synergistic blend of its component parts producing results superior to what they could achieve alone.

We can view each component of a tag recommendation system as a function ψ : U R T R, which, given a user u U and a resource r R, produces a real-valued result p as the predicted relevance of a tag t for that particular user-resource pair: ψ(u, r, t) = p. In the most common settings tag recommenders are used to produce a ranked list of suggested tags for a particular user and given a specific resource. To do so using the above formulation, for a given user u and resource r, we iterate over all tags, sort them by their corresponding relevance scores, and return the top n tags: rec(u, r) =TOP n t T ψ(u, r, t). (2) In our proposed hybrid framework the relevance score for a tag is calculated using several component tag recommenders. These scores are then combined in a linear model. Specifically, given a set of component tag recommenders C, a linear weighted hybrid tag recommender will accept a user u and resource r. It will then query each of its component recommenders, c C, for a tag, t, and combine the results in the linear model: ψ h(u,r,t)= c C α cψ c(u,r,t) (3) where ψ h(u, r, t) is the linear weighted relevance score of the tag and α c is the weight given to the component, c. It should be noted that the scores from one component may be drawn from a different distribution than the other components. In order to ensure that the relevance scores for all component recommenders are on the same scale, we normalize the scores so that each ψ c(u, r, t) falls in the interval [0,1]. As additional recommenders are added to the hybrid, its complexity grows. The challenge then becomes how to ascertain the correct α for each component in order to maximize the effectiveness of the hybrid. We use a hill climbing technique because of its speed and simplicity. The α vector is initialized with random positive numbers constrained such that the sum of the vector equals 1. The vector is then randomly modified and tested against a holdout set to ascertain if it achieves better results. The holdout set may be evaluated for recall, precision or F-measure. In this work we rely on the F-measure since it incorporates both the recall and precision. If the result is improved, the change is accepted; otherwise it is usually rejected. Occasionally a change to the α vector is accepted even when it does not improve the results in order to more fully explore the α space. Modifications continue until the vector stabilizes. In order to ensure that a local maximum has not been discovered, the experiment is repeated 20 times from different random starting points. With this integrative model any tag recommender can be incorporated into the hybrid. We focus on relatively simple component recommenders due to their speed and scrutability. We now present those components. 3.2.1 Popularity Models Perhaps the simplest recommendation strategy is merely to recommend the most commonly used tags in the system. Alternatively, given a user-resource pair a recommender may ignore the user and recommend the most popular tags for that particular resource. This strategy is strictly resource dependent and does not take into account the tagging habits of the user. We define ψ(u, r, t) for the resource based popularity recommender, pop r, as: ψ(u, r, t) = v U θ(v,r,t) (4) We define θ(v, r, t) as 1 if v tagged r with t and 0 otherwise. In a similar fashion a recommender may ignore the resource and recommend the most popular tags for that particular user. While such an algorithm would include tags frequently applied by the user, it does not consider the resource information and may recommend tags irrelevant to the current resource. We define ψ(u, r, t) for the user based popularity recommender, pop u, as: ψ(u, r, t) = s R θ(u, s, t) (5) While popularity models are not necessarily the most effective techniques, they do serve as a baseline and may benefit the hybrid. Popularity based recommenders require little online computation. They are easily built offline and can be incrementally updated. 3.2.2 User-Based Collaborative Filtering User-based collaborative filtering [5, 9, 17] works under the assumption that users who have agreed in the past are likely to agree in the future. A neighborhood, N r, of the k most similar users to u is identified through a similarity metric such that all the neighbors have tagged r. For any given resource the weighted sum can then be calculated as: ψ(u, r, t) = σ(u, v)θ(v,r,t) (6) v N r where σ(u, v) is the similarity between the users u and v. In this work we rely on cosine similarity of the user models. As before, θ(v,r,t)is1ifvhas annotated r with t and 0 otherwise. When users are modeled as resources we call this approach KNN ur. When users are modeled as tags we call this technique KNN ut. Since the algorithm will only populate the neighborhood with users that have annotated r, the number of similarities to calculate can be quite small. The popularity of resources in social annotation systems follows the power law and the great majority of resources will benefit from this reduced computation, while a few will require additional computational effort. As a result the algorithm scales well with large datasets. Similarities may even be computed offline. This approach relies on the collaboration of other users. It may be the case that an appropriate tag cannot be recommended because it does not appear in a neighbor s profile. While the personalization offered by user-based filtering is an important benefit, it lacks the ability to reflect the habits and patterns of the larger crowd. 3.2.3 Item-Based Collaborative Filtering Item-based collaborative filtering [2, 16] relies on discovering similarities among resources rather than among users. We may model the resources as a vector over the user space.

We call this model KNN ru. When relying on tags, the vector contains the frequency with which a resource has been annotated with the tags. We call this model KNN rt. We define N u as the k nearest resources to r drawn from the user profile, u, and then define the relevance score of a tag for a user-resource pair as: ψ(u, r, t) = σ(r, s)θ(u, s, t) (7) s N u If a user has annotated resources similar to r with t then ψ(u, r, t) will be high. Otherwise the relevance score will be correspondingly low. The strength of this approach is that it can draw the most relevant tags from the user profile. Its weakness is that it cannot recommend tags from outside the user profile. Similarity metrics need only be calculated with resources in the user profile. If the user profile is not exceptionally large, this computation can be quickly done in real time. Otherwise, similarities can be calculated offline. 3.3 Pair-wise Interaction Tensor Factorization For the sake of comparison, we have chosen a tag recommender based on pair-wise interaction tensor factorization [14], which formed the basis for the winning submission of the PKDD 2009 Tag Recommendation Challenge [10]. This model-based approach generates a set of factor matrices which resembles a special case of the Tucker decomposition of a tensor. The tensor itself is not directly induced by the data (this could be achieved by regarding each (u,r,t) triple as a binary cell of a tensor), but rather reflects a ranking over the tags for each user-resource pair. The model is built by first considering observations in the data of the form (u, r, t +,t ), where (u, r, t +) is a triple which is found in the data (a positive example of tag selection) and (u, r, t ) is a triple not found in the data (a negative example of tag selection). An iterative gradient-descent algorithm is employed to optimize a ranking function (based on Bayesian conditionals) that prefers positive examples in the data over negative ones. Each of four related matrices is updated until convergence is found. The matrices represent the factor-reduced components of the specialized tensor factorization M = U ktk U + R ktk R, where U k is the user factor matrix, R k is the resource factor matrix, Tk U is the tag factor matrix with respect to users and Tk R is the tag factor matrix with respect to resources, k is the selected number of factors, and M is the personalized tag-ranking tensor. Generating a tag recommendation for a given user u and resource r is simply a matter of referring to the appropriate user-resource column of the ranking tensor M. The relevance score of a tag given a user-resource pair is calculated as: ψ(u,r,t)= k U k[u][i]tk U [t][i]+r k[r][i]tk R [t][i] (8) i=1 4. EXPERIMENTAL EVALUATION In this section we describe the methods used to gather and pre-process our six datasets. Following an outline of our methodology, we examine the results of our proposed linear weighted hybrid tag recommender along with its components and the pair-wise interaction tensor factorization model. Finally we draw some general conclusions. 4.1 Datasets Our experiments are conducted using data from six large real-world social annotation systems. On all datasets we generate p-cores [8]. Users, resources and tags are removed from the dataset in order to produce a residual dataset that guarantees each user, resource and tag occur in at least p annotations. We define a annotations to include a user, a resource, and every tag the user has applied to the resource. For the larger datasets we use 20-cores. In the smaller datasets 5-cores are used. Several reasons exist to construct p-cores. By eliminating infrequent items, the size of the data is dramatically reduced allowing the application of recommendation techniques that would otherwise be computationally impractical. By removing rarely occurring users, resource or tags, noise in the data can be dramatically reduced. Because of their scarcity, these are the very items likely to confound recommenders. Recommendation in the so-called long tail is a valid area of exploration, but it lies outside the scope of this paper. Bibsonomy enables its users to annotate both URL bookmarks and journal articles. The dataset was gathered on 1 January 2009 encompassing the entire system. This data set has been made available online by the system administrators [6]. They have pre-processed the data to remove anomalies. A 5-core was taken. It contains 13,909 annotations with 357 users, 1,738 resources and 1,573 tags. Citeulike is a popular online tool used by researchers to specifically manage and catalog journal articles. The site owners make their dataset freely available to download. We use a snapshot taken as of 17 February 2009. Once a 5-core was computed, the remaining dataset contains 2,051 users, 5,376 resources, 3,343 tags and 105,873 annotations. MovieLens is a data set gathered from the corresponding MovieLens Web site and is administered by the GroupLens research lab at the University of Minnesota. It contains users, rating of movies, and tags. A 5-core was generated from the data resulting in 35,366 annotations with 819 users, 2,445 resources and 2,309 tags. Delicious is a popular Web site in which users annotate URLs. On 19 October 2008, 198 of the most popular tags were taken from the user interface and the site was recursively explored. From 20 October to 15 December, the complete profiles of 524,790 users were collected. Due to memory and time constraints, 10% of the user profiles was randomly selected, and a 20-core taken for experiments. The dataset is our largest, containing 7,665 users, 15,612 resource and 5,746 tags. It contains 720,788 annotations. Amazon is one of the world s largest retailers. The site includes a myriad of ways for users to express and discover opinions of the products: ratings, editorial reviews, customer reviews, product details, and customer purchasing habits. Recently, Amazon has added social tagging to this list. Beginning on 1 July 2009 we recursively explored the site to gather 1.5 million user profiles. Many users had extremely small profiles or used idiosyncratic tags. After taking a 20-core of the data it contained 498,217 annotations with 8,802 users, 10,679 resource and 5,559 tags. LastFM users upload their music profile, create playlists and share their musical tastes online. We selected 100 random users from the system and recursively explored the friend network. Only about 20% of the users had annotated a resource. Users have the option to tag songs, artists or albums. The tagging data here is limited to album an-

notations. Experimentation on artists and song data reveal similar trends. A p-core of 20 was drawn from the data. It contains 2,368 users, 2,350 resources, 1,141 tags and 172,177 annotations. 4.2 Methodology Each user s annotations were divided equally among five folds. Four folds were used as training data to build the recommenders. The fifth was used to train the model parameters and ascertain the optimal weights of the components in the hybrids. The results of the fifth fold was then discarded and we performed four fold cross validation on the remaining folds. The results were averaged over each user, then over the final four folds. The recommenders are evaluated on their ability to recommend tags given a user-resource pair. The user and resource for each annotation where submitted to the recommenders and the recommenders returned a set of tags, T r. These were then evaluated against the tags in the holdout annotation, T h. Recall is a common metric for evaluating the utility of recommendation algorithms. It measures the percentage of items in the holdout set that appear in the recommendation set. Recall is a measure of completeness and is defined as: Th Tr recall = (9) T h Precision is another common metric for measuring the usefulness of recommendation algorithms. It measures the percentage of items in the recommendation set that appear in the holdout set. Precision measures the exactness of the recommendation algorithm and is defined as: Th Tr precision = (10) T r The recall and precision will vary depending on the size on the recommendation set. In the following experiments we present the metrics with recommendation sets of size one through ten. 4.3 Experimental Results In this section we offer some general observations about the experimental results reported in Figure 1. We then examine each dataset individually before offering a summary of our conclusions. After tuning the variables we chose a k of 30 for all collaborative filtering techniques. The trend was for the recall and precision to steadily increase as k was increased and then suffer from diminishing returns. PITF, the pair-wise interaction tensor factorization model, was built with 64 features and a learning rate of 0.03 [14]. It was trained until convergence. We did experiments with 10 to 100 features. The results exhibited a sharp increase and then leveled out as the number of features approached 50. The hybrid reported in Figure 1 is composed of the two popularity based recommenders and four collaborative filtering recommenders. We have purposely constructed the hybrid with simple recommenders in order to permit insights into the datasets that might otherwise be obscured. By observing the importance of a component to the hybrid, we may infer the importance of the dimensions covered by that component. The composition of the hybrids is reported in Table 1. The hybrids do not draw upon PITF. A motivation of this paper is to demonstrate that hybrid recommenders can integrate multiple dimensions of the data by exploiting simple components. If PITF had been included in the hybrid it would not be clear if the success of the hybrid was owed to PITF or the ability of the hybrid to produce a synergistic blend of its constituent parts. Instead, we report the PITF results because it represents the state-of-the-art tag recommender and therefore offers an important point of comparison. While not evaluated in this paper, experimentation has revealed that incorporating PITF into the hybrid produces a small improvement over both PITF and the linear weighted hybrid. In all six datasets the hybrid outperforms its constituent parts, revealing that a linear weighted hybrid can exploit multiple dimensions of the data through its components. These components are not individually required to cover all dimensions of the data, and may instead focus on a particular dimension such as the relationship between tags and resources. When aggregated into a single framework, the components provide complementary information while maintaining their simplicity, speed and insights into the data. The hybrid is competitive with PITF, often surpassing it. In MovieLens PITF proves marginally better. In Bibsonomy, Citeulike and LastFM the results are very similar. In Delicious and Amazon the hybrid is clearly superior. The difference between Delicious and Amazon versus the other datasets is the diversity of the user profiles. Citeulike and Bibsonomy users focus on their area of expertise. Movie- Lens and LastFM users gravitate toward particular genres of music and movies. In Delicious, however, the users are able to tag web pages from across the entire Internet. Consequently, the user profiles often contain numerous unrelated topics. Similarly, Amazon users do not restrict their annotations to particular categories. The user profiles reflect the diversity one might expect of a consumer visiting the world s largest online retailer. These diverse user profiles are difficult to characterize with a feature space model, the foundation of PITF. When recommending tags, PITF cannot draw upon particular features while ignoring others. PITF may recommend a tag not relevant to the particular context. In contrast, user-based and item-based collaborative filtering is able to focus on the most relevant parts of the user profile. User-based collaborative filtering only recommends tags applied to the query resource, narrowing the focus of the recommendation regaress of the diversity in the user profile. Item-based collaborative filtering techniques construct a neighborhood of resources from the user profile most similar to the query resource, effectively ignoring parts of the user profile that are not relevant to the immediate recommendation task. Our proposed linear weighted hybrid inherits the capacity to focus on specific aspects of the user profile. The hybrid offers additional benefits. When constructed from simple yet fast components, the hybrid itself maintains these properties offering a highly scalable and easily updatable solution for tag recommendation. It is possible to explain the results from the component recommenders and consequently the hybrid itself. In contrast PITF is a black box with little explanatory capacity.

d d Figure 1: Recall (x-axis) and precision (y-axis) plotted for recommendation sets of size one through ten on the six datasets. The hybrid also offers extensibility. In this work we focused on recommenders which focus primarily on the URT data model. Other recommenders could be incorporated into the hybrid based on recency, context or content. A recommender based on recency might favor tags recently added to the user profile over tags that have not been used lately. A recommender might interpret context in a myriad of ways: recent queries, recently visited resources, or even routinely visited user profiles. Content-based recommenders might propose author s names, movie genres or product information. Other recommendation techniques are possible and can be easily included as a component in the proposed

pop u pop r KNN ur KNN ut KNN ru KNN rt Bibsonomy 0.007 0.010 0.101 0.423 0.064 0.396 Citeulike 0.017 0.034 0.066 0.265 0.109 0.509 MovieLens 0.028 0.023 0.063 0.407 0.048 0.431 Delicious 0.006 0.008 0.065 0.326 0.120 0.476 Amazon 0.021 0.039 0.122 0.435 0.172 0.212 LastFM 0.017 0.032 0.011 0.471 0.430 0.038 Table 1: Contribution of the individual components in the hybrids for each of the six data sets. framework. Other state-of-the-art techniques are often unable to accommodate this information. Generally, the hybrid draws little strength from the two popularity-based algorithms in favor of the collaborative filtering methods as shown in Table 1. KNN ur appears to be universally important across all six datasets accounting for as much as 47% of the hybrid. In most cases KNN rt is also extremely important. We now turn our attention to the six datasets and discuss each individually in respect to the performance of the individual components and the performance of the integrative models. 4.3.1 Bibsonomy The performance of the tag recommenders is presented in Figure 1. In Bibsonomy both pop u and KNN ru perform poorly. These techniques recommend tags drawn from the user profile. KNN rt also recommends tags from the user profile but performs far better; it relies on tags rather than users to model the resource. The methods that rely on the resource-tag information (pop r, KNN UR, KNN UT, KNN RT ) are tightly grouped and are among the top performing components. This analysis suggests that for the purpose of tag recommendation in Bibsonomy, the interaction between the resource and tag dimensions is dominant over the interaction between the user and tag dimensions. The integrative techniques offer a large improvement over the individual recommenders. The hybrid and PITF produce nearly identical results, marking the hybrid as a viable alternative. The hybrid relies most strongly on KNN ut and KNN rt as shown in Table 1. The user-based and itembased collaborative filtering methods appear to complement one another. This improvement may be explained through an analysis of the application. Bibsonomy is designed for researchers to share and organize scholarly references and web sites. When annotating journal articles, users often focus on their area of expertise and use domain driven tags. In this case, KNN rt may be particularly relevant as reflected in its performance in Citeulike, an application which focuses entirely on publications. When tagging web pages users may exhibit broader interests and employ more user-specific tags. For the purpose of tagging web pages, KNN ut demonstrates efficacy as it does in Delicious, a site devoted to web pages. In an application that permits the tagging of both types of resources, the hybrid can achieve maximum effectiveness when combining these two complementary components. 4.3.2 Citeulike In Citeulike we observe a social annotation system that is entirely focused on scholarly publications. Its users are often interested in a narrow field and employ tags taken from their respective research communities. In this context it is not surprising that KNN rt performs so well. It creates a neighborhood of resources drawn from the user profile and recommends tags which the user applied to these similar resources. Because the users are often interested in a narrow domain, it is relatively easy to find similar resources in the user profile. Because the user is motivated to organize resources for later retrieval (perhaps when citing research in his own publications), the tags applied to the neighbors are a good indicator of which tags should be applied for the new resource. The utility of KNN rt is also demonstrated in its dominance in the hybrid. Its α is more than 50% as shown in Table 1. As usual KNN ut plays an important role in the hybrid, promoting tags that have been applied to the resource by other users. The hybrid outperforms PITF for smaller recommendation sets, but as the size of the recommendation set increases the results become nearly identical. 4.3.3 MovieLens MovieLens exhibits similar patterns to Citeulike. The ordering of the components is nearly identical. The hybrid and PITF produce a modest improvement over KNN rt and the hybrid is composed primarily of KNN rt and secony by KNN ut. This may be due to the similarity of how users interact with the two systems. MovieLens users will gravitate toward particular types of movies; in Citeulike users will focus on their area of research. In MovieLens users might be influenced by the labels often attributed to movies ( action, horror, romance ); likewise, Citeulike users often employ labels taken from their area of expertise. The similarity in how users interact with the system result in datasets with similar underlying characteristics. The composition of the hybrid is mostly KNN ut and KNN rt. In this dataset, PITF outperforms the hybrid by a small but statistically significant amount. PITF appears able to identify important latent features unattainable by the component recommenders and consequently the hybrid. 4.3.4 Delicious Delicious is our largest and most diverse dataset. It contains 720,788 annotations, in which users tag web pages. The worst recommender is pop u. In no other dataset does it perform so poorly. This indicates that users in Delicious are not as likely to reuse tags as users in other systems, perhaps because the resource space is much broader, encompassing the entire Internet. On the other hand, pop r does remarkably well for such a simple recommender, revealing that the users are arriving at a consensus on how to label resources.

The two user-based collaborative filtering methods perform similarly. Drawing upon a neighbor s opinion about a web site, appears to do well whether or not that neighbor was discovered by modeling users as resources or as tags. In contrast there is a marked difference in the two item-based methods: KNN rt does far better than KNN ru, suggesting once again that in the confines of tag recommendation resources are better modeled by tags than by users. PITF outperforms all the individual recommenders. The hybrid offers a clear improvement over the other methods including PITF. As with most of the datasets it strongly relies on KNN ut and KNN rt. 4.3.5 Amazon Amazon presents one of the easier targets for tag recommendation. The two integrative models achieve better than 95% recall for a recommendation set with ten tags. The hybrid clearly outperforms PITF. In this dataset KNN ur and KNN ut are relatively close and run parallel to one another. Likewise, KNN ru and KNN rt do the same. This congruence suggests that multiple dimensions of the dataset contain valuable information. Given the task of tag recommendation, however, it appears that it is marginally better to model users and resources over the tag space. Amazon users tag products for later retrieval. Very often they use tags drawn from the product space such as action or dvd. This behavior is similar to that observed in Citeulike. In contrast, Amazon users rarely limit themselves to a narrow range of items. They may freely label books, electronics or clothing. As a result the user-based collaborative filtering is more competitive. It selects tags already applied to the resource rather than relying on tags applied by the user to similar items. Unlike Citeulike, it is not as likely that the user profile will contain these similar items. 4.3.6 LastFM LastFM is another easy target for tag recommenders offering more than 90% recall. The results of the two integrative approaches are so similar that the recall-precision lines obscure one another. LastFM users appear to reuse tags to a high degree as made evident by the success of pop u. In contrast the poor results of pop r show that users do not often agree on how to label a resource. In LastFM, item-based collaborative filtering does very well drawing upon the user s prevalence to tag similar items in a similar manner. User-based filtering, which relies on the opinions of others does poorly. The composition of the hybrid reveals a sharp departure from the other datasets. It favors KNN ru over KNN rt even though KNN rt does marginally better as an individual recommender. The importance of modeling resources as users in the hybrid may be due to the interaction of users within the social annotation system. An important focus of the application is the sharing and discovery of resources through the user space. 4.4 Summary These results underscore the importance of an integrative approach to tag recommendation in social annotation systems. Social annotation systems vary in how users interact with the system. The differences between datasets make the performance of individual recommenders unpredictable. For example, KNN ru does well in LastFM but performs poorly in Delicious. In contrast, the integrative techniques perform well regaress of the characteristics of the data. The proposed linear weighted hybrid offers additional benefits. It is easily extensible. In this work we constructed the hybrid with popularity based and collaborative filtering techniques, but the hybrid could be augmented with recommendation techniques that draw from different approaches such as recency, content or context based recommenders. When constructed from individual components the hybrid is easily updatable and suitable for large scale deployment. The use of individual components also permits the examination of the underlying characteristics of the data through an analysis of the contributions of the components and the dimensions of the data which they exploit. Furthermore individual recommendations can be explained, a capability not shared by black-box recommenders such as PITF. In many cases the hybrid outperforms the pair-wise interaction tensor factorization model. In Delicious and Amazon where the user models are most diverse the benefit is most noticeable. This marks our proposed linear weighted hybrid as a viable state-of-the-art tag recommender. 5. USING INFORMATION CHANNELS TO EXPLAIN THE PERFORMANCE OF TAG RECOMMENDERS Our results have demonstrated a difference in how individual component recommenders perform. In this section we turn our attention to why these differences may occur. To that end we introduce the notion of information channels. An information channel models the relationship between the underlying dimensions of an annotation system: users, resources and tags. A strong information channel between two dimensions means that information in the first dimension will be useful in building a predictor for the second dimension. For example, a strong information channel between users and tags means that user characteristics will be a good basis on which to predict tags. We first define information channels in terms of conditional entropy. We then explore the impact of information channels on the previously presented component recommenders. Finally, we offer a summary of our findings. 5.1 Quantifying Information Channels We propose entropy and conditional entropy for the evaluation of information channels. Entropy measures the amount of uncertainty associated with a dimension, in this case the user, resource or tag dimensions. It relies heavily on probabilities, however the notion of probabilities in social annotation systems can be ambiguous. The probability of resource might be its likelihood to occur in a user profile, a tag profile or in an annotation. We define the probability of a resource r as: u U t T URT(u, r, t) p(r) = (11) y where y is defined as the number of non-zero entries in URT. We may then define the entropy as: H(R) = r R p(r)log yp(r) (12)

H(U) H(U R) H(U T ) H(R) H(R U) H(R T ) H(T ) H(T U) H(T R) Bibsonomy 0.462 0.187 0.273 0.666 0.391 0.348 0.564 0.375 0.246 Citeulike 0.597 0.179 0.260 0.726 0.308 0.337 0.605 0.268 0.217 MovieLens 0.491 0.249 0.163 0.683 0.441 0.293 0.648 0.320 0.258 Delicious 0.551 0.257 0.431 0.631 0.338 0.418 0.434 0.315 0.221 Amazon 0.609 0.275 0.348 0.646 0.312 0.297 0.505 0.244 0.156 LastFM 0.608 0.314 0.357 0.623 0.328 0.431 0.436 0.185 0.245 Table 2: The entropy and conditional entropy of users, resources and tags across all six datasets. Entropy calculations often use the log base of 2, 10 or e. In this work we use a base of y. Doing so bounds the maximum entropy to 1. This will not change the relative values within a dataset, but it will permit the comparison of values across datasets. Conditional entropy measures the uncertainty of a dimension given another dimension. The conditional entropy of the resource space given the tag space is defined as: H(R T )= p(r, t) p(r, t)log y (13) p(t) r R t T where p(r, t) is the likelihood of r and t occurring together in URT, or more formally: u U p(r, t) = URT(u,r,t) (14) y The conditional entropy of resources given users, H(R U) can be similarly calculated. Once H(R), H(R T ) and H(R U) have been calculated, it is possible to evaluate the information channels. If H(R T ) is roughly equal to H(R), it means that tags are not offering additional information about the resource space; it might then be difficult to predict a resource given a tag. On the other hand, if H(R T ) is less than H(R) it means that tags may be a good predictor of resources. Comparing the H(R T ) and H(R U) values may suggest which information channel is most useful. Analogous definitions can be constructed for the entropy and conditional entropy of the user and tag spaces. It is important to note that H(R T ) is not equal to H(T R). It may be the case that tags are good predictors of resources, but resources are not good predictors of tags. 5.2 Information Channels and Component Recommenders The metrics are reported in Table 2. The entropy of the tag space appears to coincide with the difficulty in recommending tags. The largest H(T ) is found in MovieLens, where the top recommenders achieve a precision of just over 50%. The next largest values occur in Citeulike and Bibsonomy where precision reaches 60% and 70% respectively. Amazon and LastFM produce the lowest values and allow precisions of more than 80%. In general the higher the entropy of the tag space, the more difficult it is to recommend tags. The exception to this trend is Delicious which appears to have low entropy but presents a more difficult target. This is explainable by the higher H(T U) and H(T R); the user and resource dimensions do not offer the same utility as they do in Amazon and LastFM. The two user-based collaborative filtering methods build a neighborhood of similar users. This neighborhood is restricted to users that have annotated the query resource. In this respect both KNN ur and KNN ut draw from the userresource channel. Both algorithms recommend tags applied to the input resource, emphasizing the resource-tag channel. The algorithms differ in the way they model users. KNN ur models users over the resource space, reusing the user-resource channel. KNN ut, on the other hand, models users over the tag space adding a new dimension to the algorithm. This fundamental analysis based on information channels suggests that KNN ut should outperform KNN ur. In all six cases presented in Figure 1 it does. Quantifying the strength of the information channels permits further insights. In Delicious, KNN ur comes closest to KNN ut. H(U T ) is 0.431 compared to the H(U) of 0.551; it appears that tags are not adding a great deal of new information. The resource-user channel, on the other hand, is stronger; the H(U R) of 0.257 suggests that resources are much better than tags at modeling users. In this case, it is advantageous to reuse the user-resource channel. Bibsonomy, Citeulike and Amazon show similar trends. The most extreme difference between KNN ur and KNN ut occurs in LastFM. In this case H(U R) and H(U T ) show that resources are not any better than tags in modeling users and the additional dimension covered by KNN ut allows the better results. MovieLens displays similar characteristics. In the item-based collaborative filtering methods the recommended tags are drawn directly from the user profile stressing the user-tag channel. The user-resource channel is exploited by focusing on resources from the user s profile. The two methods differ in how the resource is represented. KNN ru models the resource as a vector over the user space and KNN rt models the resources as a vector over the tag space. Since KNN rt is adding an additional information channel to the approach, we expect it to outperform KNN ru. In all six cases we observe this to be true. This theoretical analysis based on information channels is once again augmented by an examination of the metrics. In LastFM KNN ru performs nearly as well as KNN rt This is due to the fact that the H(T U) is so low in LastFM; users reuse tags with such consistency that it matters little how the resources are modeled. Likewise the congruence of the two models in Amazon is owed to the low overall H(T ) and ability to represent resources as tags demonstrated by H(T R). In the remaining datasets where KNN rt is clearly outperformed by KNN rt, H(T ) is larger and it appears to be more difficult to model users with resources. These results point toward a framework for understanding the structure of social annotation data. These systems vary in the way users interact with the application, producing underlying characteristics which draw upon different dimensions of the data. Our information channel metrics

based on entropy and conditional entropy attempt to reveal these characteristics and explain the performance of tag recommenders across several datasets. 6. CONCLUSIONS This paper has explored the problem of tag recommendation in social annotation systems and proposed a weighted linear hybrid incorporating simple popularity and collaborative filtering components. The success of the hybrid over the lower-dimensional components demonstrates clearly the importance of an integrative approach that exploits multiple dimensions of the data. Evaluations also show that the hybrid matches or outperforms a state-of-the-art model-based algorithm based on tensor factorization (PITF), particularly when the user profiles are diverse. The weighted hybrid has the additional advantages of being more efficient, scalable, extensible and explainable than PITF. Experiments across six real-world datasets reveal interesting differences between social annotation applications, a result of the widely-varying user populations, resource types and application characteristics found in these applications. These differences are revealed most clearly in the performance of the individual components of the hybrid, which vary widely from dataset to dataset. By measuring characteristics of the data via the metrics of entropy and conditional entropy, we show that it is possible to explain in qualitative terms the reasons for these differences in recommender performance. ACKNOWLEDGMENTS This work was supported in part by the National Science Foundation Grant IIS-0916852 and a grant from the Department of Education, Graduate Assistance in the Area of National Need, P200A070536. 7. REFERENCES [1] R. Burke. Hybrid Recommender Systems: Survey and Experiments. User Modeling and User-Adapted Interaction, 12(4):331 370, 2002. [2] M. Deshpande and G. Karypis. Item-Based Top-N Recommendation Algorithms. ACM Transactions on Information Systems, 22(1):143 177, 2004. [3] J. Gemmell, M. Ramezani, T. Schimoler, L. Christiansen, and B. Mobasher. A Fast Effective Multi-Channeled Tag Recommender. ECML/PKDD 2009 Discovery Challenge Workshop, part of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, pages 59 63, 2009. [4] J. Gemmell, T. Schimoler, M. Ramezani, L. Christiansen, and B. Mobasher. Improving FolkRank With Item-Based Collaborative Filtering. Recommender Systems & the Social Web, 2009. [5] J. Herlocker, J. Konstan, A. Borchers, and J. Ri. An Algorithmic Framework for Performing Collaborative Filtering. In 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, page 237. ACM, 1999. [6] A. Hotho, R. Jaschke, C. Schmitz, and G. Stumme. BibSonomy: A social bookmark and publication sharing system. In Proceedings of the Conceptual Structures Tool Interoperability Workshop at the 14th International Conference on Conceptual Structures, pages 87 102. Citeseer, 2006. [7] A. Hotho, R. Jaschke, C. Schmitz, and G. Stumme. Information Retrieval in Folksonomies: Search and ranking. Lecture Notes in Computer Science, 4011:411 426, 2006. [8] R. Jaschke, L. Marinho, A. Hotho, L. Schmidt-Thieme, and G. Stumme. Tag Recommendations in Folksonomies. Lecture Notes In Computer Science, 4702:506, 2007. [9] J. Konstan, B. Miller, D. Maltz, J. Herlocker, L. Gordon, and J. Ri. GroupLens: Applying Collaborative Filtering to Usenet News. Communications of the ACM, 40(3):87, 1997. [10] L. Marinho, C. Preisach, L. Schmidt-Thieme, I. Cantador, D. Vallet, J. Jose, H. Cao, M. Xie, L. Xue, C. Liu, et al. ECML PKDD Discovery Challenge 2009-DC09. [11] A. Mathes. Folksonomies-Cooperative Classification and Communication Through Shared Metadata. Computer Mediated Communication, (Doctoral Seminar), Graduate School of Library and Information Science, University of Illinois Urbana-Champaign, December, 2004. [12] P. Mika. Ontologies are us: A unified model of social networks and semantics. Web Semantics: Science, Services and Agents on the World Wide Web, 5(1):5 15, 2007. [13] S. Rene and L. Schmidt-Thieme. Factor Models for Tag Recommendation in BibSonomy. ECML/PKDD 2008 Discovery Challenge Workshop, part of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, pages 235 243, 2009. [14] S. Rene and L. Schmidt-Thieme. Pairwise Interaction Tensor Factorization for Personalized Tag Recommendation. In Proceedings of the third ACM international conference on Web search and data mining, pages 81 90. ACM, 2010. [15] G. Salton, A. Wong, and C. Yang. A Vector Space Model for Automatic Indexing. Communications of the ACM, 18(11):613 620, 1975. [16] B. Sarwar, G. Karypis, J. Konstan, and J. Rei. Item-Based Collaborative Filtering Recommendation Algorithms. In 10th International Conference on World Wide Web, page 295. ACM, 2001. [17] U. Shardanand and P. Maes. Social Information Filtering: Algorithms for Automating ŞWord of MouthŤ. In SIGCHI Conference on Human Factors in Computing Systems, pages 210 217. New York, NY, USA, 1995. [18] P. Symeonidis, A. Nanopoulos, and Y. Manolopoulos. Tag recommendations based on tensor dimensionality reduction. In Proceedings of the 2008 ACM conference on Recommender systems, pages 43 50. ACM, 2008. [19] P. Symeonidis, A. Nanopoulos, and Y. Manolopoulos. A Unified Framework for Providing Recommendations in Social Tagging Systems Based on Ternary Semantic Analysis. IEEE Transactions on Knowledge and Data Engineering, 2009.