Workshop Proceedings PL-12

Size: px

Start display at page:

Download "Workshop Proceedings PL-12"

Scarlett Hopkins
5 years ago
Views:

1 Workshop Proceedings PREFERENCE LEARNING: PROBLEMS AND APPLICATIONS IN AI PL-12 August 28, 2012 Montpellier, France

3 Preference Learning: Problems and Applications in AI (PL-12) The topic of preferences has recently attracted considerable attention in Artificial Intelligence (AI) research, notably in fields such as agents, non-monotonic reasoning, constraint satisfaction, planning, and qualitative decision theory. Preferences provide a means for specifying desires in a declarative way, which is a point of critical importance for AI. Drawing on past research on knowledge representation and reasoning, AI offers qualitative and symbolic methods for treating preferences that can reasonably complement hitherto existing approaches from other fields, such as decision theory and economic utility theory. Needless to say, however, the acquisition of preferences is not always an easy task. Therefore, not only are modeling languages and representation formalisms needed, but also methods for the automatic learning, discovery, and adaptation of preferences. It is hence hardly surprising that methods for learning and predicting preference models from explicit or implicit preference information and feedback are among the very recent research trends in machine learning and related areas. The goal of this workshop is on the one hand to continue a series of successful workshops at the ECML/PKDD conference series (PL-08, PL-09, PL-10), but, more importantly, also to expand the scope by drawing the attention to a broader AI audience. In particular, we seek to figure out new problems and applications of preference learningnatural language processing, game playing, decision making and planning. Indeed, we believe that there is a strong potential for preference learning techniques in these areas, which has not yet been fully explored. The papers presented in this workshop provide a spotlight on ongoing research in this area. They make technical contributions to the field, but also show work in progress in this area. We hope that these proceedings help to spark further interest in this emerging research area and contribute to establishing preference learning as an important in AI research. Johannes Fürnkranz Eyke Hüllermeier

4 ii

5 Contents A Preliminary Study on a Recommender System for the Million Songs Dataset Challenge F. Aiolli Using and Learning GAI-Decompositions for Representing Ordinal Rankings D. Bigot, H. Fargier, J. Mengin, B. Zanuttini Learning Conditional Lexicographic Preference Trees M. Bräuning and E. Hüllermeier An apple-to-apple comparison of Learning-to-rank algorithms in terms of Normalized Discounted Cumulative Gain R. Busa-Fekete, G. Szarvas, T. Éltető, B. Kégl Alleviating Cold-User Start Problem with Users Social Network Data in Recommendation Systems E. Castillejo and A. Almeida and D. López-de-Ipiña Preprocessing Algorithm for Handling Non-Monotone Attributes in the UTA method A. Eckhardt, T. Kliegr Learning from Pairwise Preference Data using Gaussian Mixture Model M. Grbovic and N. Djuric and S. Vuceti Preference Function Learning over Numeric and Multi-valued Categorical Attributes L. Marin, A. Moreno, D. Isern Direct Value Learning: a Preference-based Approach to Reinforcement Learning D. Meunier, Y. Deguchi, R. Akrour, E. Suzuki, M. Schoenauer, M. Sebag.. 42 Large Scale Co-Regularized Ranking E. Tsivtsivadze, K. Hofmann, T. Heskes First Steps Towards Learning from Game Annotations C. Wirth and J. Fürnkranz iii

6 iv

7 A Preliminary Study on a Recommender System for the Million Songs Dataset Challenge Fabio Aiolli 1 Abstract. In this paper the preliminary study we are conducting on the Million Songs Dataset (MSD) challenge is described. The task of the competition is to suggest a set of songs to a user given half of its listening history and complete listening history of other 1 million people. We focus on memory-based collaborative filtering approaches since they are able to deal with large datasets in an efficient and effective way. In particular, we investigated on i) defining suitable similarity functions, ii) studying the effect of the locality of the collaborative scoring function, and iii) aggregating multiple ranking strategies to define the overall recommendation. Using these techniques we are in the top positions according to the current standing of the competition leaderboard (at the moment of this writing the challenge has about 150 registered teams). 1 Introduction The Million Song Dataset Challenge [5] is a large scale, music recommendation challenge, where the task is to predict which songs a user will listen to, provided the listening history of the user. The challenge is based on the Million Song Dataset (MSD), a freelyavailable collection of meta data for one million of contemporary songs (e.g. song titles, artists, year of publication, audio features, and much more) [2]. About one hundred and fifty teams are currently participating to the challenge. The subset of data actually used in the challenge is the so called Taste Profile Subset that consists of more than 48 million triplets (user,song,count) gathered from user listening histories. Data consists of about 1.2 million users and covers more than 380,000 songs in MSD. The user-item matrix is very sparse as the fraction of non-zero entries (the density) is only 0.01%. Collaborative Filtering (CF) is a technology which uses the items by user matrix to discover other users with similar tastes as the active user for which we want to make the prediction. The intuition is that if other users, similar to the active user, already purchased a certain item, then it is likely that the active user will like that item as well. A similar (dual) consideration can be made by changing the point of view. If we know that a set of items are often purchased together (they are similar in some sense), then, if the active user has bought one of them, probably he/she will be interested to the other as well. The first view is the one that is prevalent in recent CF literature. In this paper, we show that the second view turned out more useful when used for the MSD competition. In Section 2 collaborative filtering is described and proposed as a first approach to solve the problem of MSD. In the same section three different views of the memory-based CF task are proposed that motivated the different algorithms of the section. In Section 3 empirical results of the proposed techniques are presented and discussed. 1 University of Padova, Italy, aiolli@math.unipd.it 2 A CF approach to the MSD task Collaborative filtering techniques use a database in the form of a user-item matrix R of preferences. In a typical CF scenario a set U of n users and a set I of m items exist and the entries of R = {r ui} R n m represent how much user u likes item i. In this paper, we assume r ui {0, 1} as this is the setting of the MSD challenge. Entries r ui represent the fact that user u have listened to (or would like to listen to) the song i. In the following we refer to items or songs interchangeably. The MSD challenge task is more properly described as a top-τ recommendation task. Specifically, we want to identify a list of τ (τ = 500 in the challenge) items I u I that active user u will like the most. Clearly, this set must be disjoint with the set of items already rated (purchased, or listened to) by the active user. 2.1 Memory-based CF In memory-based CF algorithms the entire user-item matrix is used to generate a prediction. Generally, given a new user for which we want to obtain the prediction, the set of items to suggest are computed looking at similar users. This strategy is typically referred to as user-based recommendation. Alternatively, in the item-based recommendation strategy, one computes the most similar items for the items that have been already purchased by the active user, and then aggregates those items to form the final recommendation. There are many different proposal on how to aggregate the information provided by similar users/items (see [7] for a good survey). We focus on the simple weighted sum strategy. A deeper analysis of this simple strategy allows us to highlight an interesting duality that exists between user-based and item-based recommendation algorithms. In the user-based type of recommendation, the scoring function, on the basis of which the recommendation is made, is computed by h U ui = f(w uv)r vi = f(w uv), v U v U(i) that is, the score obtained on an item for a target user is proportional to the similarities between the target user u and other users v that have purchased the item i (v U(i)). This score will be higher for items which are often rated by similar users. On the other hand, within a item-based type of recommendation [3, 6], the target item i is associated with a score h S ui = f(w ij)r uj = f(w ij), j I j I(u) and hence, the score is proportional to the similarities between item i and other items already purchased by the user u (j I(u)). ECAI-12 Workshop on Preference Learning: Problems and Applications in AI 1

8 The function f(w) can be assumed monotonic not decreasing and its role is to emphasize/deemphasize similarity contributions in such a way to adjust the locality of the scoring function, that is how many of the nearest users/items really matter in the computation. As we will see, a correct setting of this function turned out to be very useful with the challenge data. Interestingly, in both cases, we can decompose the user and item contributions in a linear way, that is, we can write h U ui = w u r i, w u R n, and h S ui = wi r u, w i R m. In other words, we are defining an embedding for users (in user based recommendation systems) and for items (in item based recommendation systems). In the specific case above, this corresponds to choose the particular vector r i as the vector with n entries in {0, 1}, where r (i) u = r ui. Similarly, for the representation of users in item-based scoring, we choose r u as the vector with m entries in {0, 1}, such that r (u) i = r ui. The main contribution of this paper is to explore how we can set the vectors w i and w u in a principled way by using the entire user-item preference matrix on-the-fly when a new recommendation has to be done. Alternatively, we can also try to learn the weight vectors from data by noticing that a recommendation task can be seen as a multilabel classification problem where songs represent the labels and users represent the examples. We have performed preliminary experiments in this sense using the preference learning approach described in [1]. The results are promising but the problem in this case is the computational requirements of a model-based paradigm like this. For this reason we decided to postpone a further analysis of this setting to future works. Finally, an alternative and useful way to see at the duality between user-based and item-based recommendation is the following that highlight a clear connection between this task and link prediction. Specifically, we can think of the user-item matrix R = {r ui} as a bipartite graph where the set of nodes is N = U I, and there exist only edges from u U to i I whenever r ui = 1. Now, the user based recommendation strategy corresponds to the intuition that if a user tends to link to the same set of items as the active users, then, this gives us some evidence that can exist a link between the active user and the item i. Dually, in item-based recommendation, the intuition is that, if the active user links to one out of two items which tend to be linked by the same users, then, we can infer that the active user will probably link the other item as well. 2.2 User-based and Song-based similarity A large part of CF literature in the last decade deals with the problem of defining a good similarity measure. A common opinion is that it cannot exist a single similarity measure that can fit all possible domains where collaborative filtering is used. In this section, we try to define a parametric family of user-based and item-based similarities that can fit different problems. In the challenge, we have not relevance grades since the ratings are binary values. This is a first simplification we can exploit in the definition of similarity functions. The similarity function that is commonly used in this case, both for the user-based case and the itembased case, is the cosine similarity. In the case of binary grades the cosine similarity can be simplified as in the following. Let I(u) be the set of items rated by a generic user u, then the cosine similarity between two users u, v is defined by I(u) I(v) w uv = I(u) 1 2 I(v) 2 1 and, similarly for items, by setting U(i) the set of users which have rated item i, we have: U(i) U(j) w ij =. U(i) 2 1 U(j) 1 2 The cosine similarity has the nice property to be symmetric but, as we show in the experimental section, it might not be the better choice. In fact, especially for the item case, we are more interested in computing how likely it is that an item will be appreciated by a user when we already know that the same user likes another item. It is clear that this definition is not symmetric. As an alternative to the cosine similarity and, we think, a more well founded way of computing weights w ij in this case, is by resorting to the conditional probability measure which can be estimated with the following formulas: and w uv = P (u v) = w ij = P (i j) = I(u) I(v) I(v) U(i) U(j) U(j) Previous works (see [4] for example) pointed out that the conditional probability measure of similarity, P (i j), has the limitation that items which are purchased frequently tend to have higher values not because of their co-occurrence frequency but instead because of their popularity. In our opinion, this might not be a limitation in a recommendation setting. Perhaps, this could be an undesired feature when we want to cluster items. In fact, this correlation measure has not to be thought of as a real similarity measure. As we will see, experimental results seem to confirm this hypothesis, at least in the item-based similarity case. Now, we are able to propose a parametric generalization of the above similarity measures. This parametrization permits us ad-hoc optimizations of the similarity function for the domain of interest. For example, this can be done by validating on available data. Specifically, we propose to use the following combination of conditional probabilities: w uv = P (v u) α P (u v) 1 α w ij = P (j i) α P (i j) 1 α (1) where α [0, 1] is a parameter to tune. As above, we estimate the probabilities by resorting to the frequencies in the data and derive the following: w uv = I(u) I(v) U(i) U(j) w I(u) α I(v) 1 α ij =. (2) U(i) α U(j) 1 α It is easy to note that the standard similarity based on the conditional probability P (u v) (resp. P (i j)) is obtained setting α = 0, the other inverted conditional P (v u) (resp. P (j i)) is obtained setting α = 1, and, finally, the cosine similarity case is obtained when α = 1. This analysis also suggests an interesting interpretation of 2 the cosine similarity on the basis of conditionals. 2.3 Locality of the Scoring Function In Section 2 we have seen that, in memory based CF, the final recommendation is computed by a scoring function which aggregates the scores obtained using individual users or items. So, it is important to determine how much each individual scoring component influences the overall scoring. This is the role of the function f(w). In the following experiments we use the exponential family of functions, that 2 ECAI-12 Workshop on Preference Learning: Problems and Applications in AI 2

9 is f(w) = w q where q N. The effect of this exponentiation is the following. When q is high, smaller weights drop to zero while higher ones are (relatively) emphasized. At the other extreme, when q = 0, the aggregation is performed by simply adding up the ratings. We can note that, in the user-based type of scoring function, this corresponds to take the popularity of an item as its score, while, in the case of item-based type of scoring function, this would turn out in a constant for all items (the number of ratings made by the active user). 2.4 Ranking Aggregation There are many sources of information available regarding songs. For example, it could be useful to consider the additional meta-data which are also available and to construct alternative rankings based on that. It is always difficult to determine a single strategy which is able to correctly rank the songs. An alternative is to use multiple strategies, generate multiple rankings, and finally combine those rankings. Typically, these different strategies are individually precision oriented, meaning that each strategy is able to correctly recommend a few of the correct songs with high confidence but, it may be that, other songs which the user likes, cannot be suggested by that particular ranker. Hopefully, if the rankers are different, then the rankers can recommend different songs. If this is the case, a possible solution is to predict a final recommendation that contains all the songs for which the single strategies are more confident. A stochastic version of this aggregation strategy can be described in the following way. We assume we are provided with the list of songs not yet rated by the active user in order of confidence for all the available strategies. On each step, the recommender randomly choose one of the lists according to a probability distribution p i over the predictors and recommends the best scored item of the list which has not yet been inserted in the current recommendation. 3 Experiments and Results In the MSD challenge we have: i) the full listening history for about 1M users, ii) half of the listening history for 110K users (10K validation set, 100K test set), and we have to predict the missing half. Further, we also prepared a home-made validation subset (HV) of the original training data of about 900K users of training (HVtr, with full listening history). The remaining 100K user s histories has been split in two halves (HVvi the visible one, HVhi the hidden one). The experiments presented in this section are based on this HV data and compare different similarities and different approaches. The baseline is represented by the simple popularity based method which recommends the most popular songs not yet listened to by the user. Besides the baseline, we report experiments on both the user-based and song-based scoring functions, and an example of the application of ranking aggregation. 3.1 Taste Profile Subset Stats For completeness, in this section, we report some statistics about the original training data. In particular, the following table shows the minimum, maximum, and average, number of users per song and songs per user. The median value is also reported.. min max ave median users per song songs per user We can see that the large majority of songs have only few users which listened to it (less than 13 users for half of the songs) and the large majority of users have listened to few songs (less than 27 for half of the users). These characteristics of the dataset make the top-τ recommendation task quite challenging. 3.2 Truncated Mean Average Precision Conformingly to the challenge, we used the truncated map (mean average precision) as the evaluation metric. Let y denote a ranking over items, where y(p) = i means that item i is ranked at position p. The map metric emphasizes the top recommendations. For any k τ, the precision at k (π k ) is defined as the proportion of correct recommendations within the top-k of the predicted ranking (assuming the ranking y does not contain the visible songs), π k (u, y) = 1 k k p=1 r uy(p) For each user the (truncated) average precision is the average precision at each recall point: AP (u, y) = 1 τ u τ π k (u, y)r uy(p) p=1 where τ u is the smaller between τ and the number of user u s positively associated songs. The average of AP (u, y u) s over all users gives the mean average precision (map). 3.3 Results The result obtained on the HV data with the baseline (recommendation by popularity) is presented in Table 1. With this strategy, each song i simply gets a score proportional to the number of users U(i) which listened to the song. Baseline (Recommendation by Popularity) Table 1: map@500 obtained by the baseline method (song popularity) on HV data. In Table 2, we report on experiments that show the effect of the locality parameter q for different strategies: item based and user based (both conditional probability and cosine versions). As we can see, beside the case IS with cosine similarity (Table 2b), a correct setting of the parameter q drammatically improves the effectiveness on HV data. We can clearly see that the best performance is reached with the conditional probability on an item based strategy (Table 2a). In Figure 1, we present results obtained fixing the parameter q and varying the parameter α for both user-based and item-based recommendation strategies. We see that, in the item-based case, the results improve when setting a non-trivial α. In fact, the best result has been obtained for α = Finally, in Table 3, two of the best performing rankers are combined, and their recommendation aggregated, by using the stochastic algorithm described in Section 2.4. In particular, in order to maximize the diversity of the two rankers, we aggregated an item-based ranker with a user-based ranker. We can see that the combined performance improves further on validation data. Building alternative and effective rankers based on available meta-data is not a trivial task and it was not the focus of our current study. For this we decided to postpone this additional analysis to a near future. 3 ECAI-12 Workshop on Preference Learning: Problems and Applications in AI 3

10 IS (α = 0) map@500 q= q= q= q= q= (a) US (α = 0) map@500 q= q= q= q= q= q= q= (c) IS (α = 1 2 ) map@500 q= q= q= q= q= (b) US (α = 1 2 ) map@500 q= q= q= q= q= q= q= Table 2: Results obtained by item-based (IS) and user-based (US) CF methods varying the locality parameter (exponent q) of the similarity function. map@ α (a) IS with 0 α 0.5, q = 3, best-map@500: (α = 0.15) (d) (IS, α = 0.15, q = 3) (US, α = 0.3, q = 5) map@ (a) Table 3: Results obtained aggregating the rankings of two different strategies, item-based (IS, α = 0.15, q = 3) and user-based (US, α = 0.3, q = 5), with different combinations. ACKNOWLEDGEMENTS We would like to thank the referees for their comments, which helped improve this paper considerably. REFERENCES [1] Fabio Aiolli and Alessandro Sperduti, A preference optimization based unifying framework for supervised learning problems, in Preference Learning, eds., Johannes Fürnkranz and Eyke Hüllermeier, 19 42, Springer-Verlag, (2010). [2] Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere, The million song dataset, in Proceedings of the 12th International Conference on Music Information Retrieval (ISMIR 2011), (2011). [3] Mukund Deshpande and George Karypis, Item-based top-n recommendation algorithms, ACM Trans. Inf. Syst., 22(1), , (2004). [4] George Karypis, Evaluation of item-based top-n recommendation algorithms, in CIKM, pp , (2001). [5] Brian McFee, Thierry Bertin-Mahieux, Daniel P.W. Ellis, and Gert R.G. Lanckriet, The million song dataset challenge, in Proceedings of the 21st international conference companion on World Wide Web, WWW 12 Companion, pp , New York, NY, USA, (2012). ACM. [6] Badrul M. Sarwar, George Karypis, Joseph A. Konstan, and John Riedl, Item-based collaborative filtering recommendation algorithms, in WWW, pp , (2001). [7] Xiaoyuan Su and Taghi M. Khoshgoftaar, A survey of collaborative filtering techniques, Advances in Artificial Intelligence, (January 2009) map@ α (b) US with 0 α 1, q = 5, best-map@500: (α = 0.6) Figure 1: Results obtained by item-based (IS) and user-based (US) CF methods varying the α parameter. 4 ECAI-12 Workshop on Preference Learning: Problems and Applications in AI 4

11 Using and Learning GAI-Decompositions for Representing Ordinal Rankings Damien Bigot 1, Hélène Fargier 1, Jérôme Mengin 1, Bruno Zanuttini 2 3 Abstract. We study the use of GAI-decomposable utility functions for representing ordinal rankings on combinatorial sets of objects. Considering only the relative order of objects leaves a lot of freedom for choosing a particular utility function, which allows one to get more compact representations. We focus on the problem of learning such representations, and give a polynomial PAC-learner for the case when a constant bound is known on the degree of the target representation. We also propose linear programming approaches for minimizing such representations. 1 INTRODUCTION The development of interactive systems for supporting decisionmaking and of recommender systems highlighted the need for models capable of using a user s preferences to guide her choices. For over fifteen years, the modelling and compact representation of preferences have been active topics in Artificial Intelligence [15, 16, 5, 6, 11]. Existing formalisms are rich and flexible enough to describe the behaviour of complex decision rules. However, for being interesting in practice, these formalisms must also permit fast elicitation of a user s preferences, involving a reasonable amount of interaction only. Configuration of combinatorial products in business-tocustomer problems [14] and preference-based search [18] are good examples of decision problems in which the user s preferences are not known a priori. In such applications, a single interaction with the user must typically last at most 0.25 s, and the whole session must typically last at most 20 minutes, even if the object to be recommended to the user is searched for in a combinatorial set. When the user s preferences are qualitative and have a simple structure (for instance, when they are separable), conditional preference networks (CP-nets) and their variants [5, 4, 6] are popular representation frameworks. In particular, CP-nets come with efficient algorithms for finding most preferred extensions of objects (outcome optimisation problem), and with efficient elicitation procedures [13]. Unfortunately, CP-nets suffer a lack of expressivity, since most complete pre-orders cannot be represented by simple (acyclic) CP-nets. Contrastingly, generalised additively independent decompositions (GAI-decompositions) of utility functions [9, 1, 11] can represent complete pre-orders in a compact way, by exploiting the independencies between sets of variables. In a word, these are representations of utility functions by sums of the form n i=1 ui(zi), where the ui s are sub-utility functions with (hopefully) small scopes Z i. 1 IRIT, Univ. Toulouse, France; prenom.nom@irit.fr 2 GREYC, Univ. Caen, France; prenom.nom@unicaen.fr 3 Partially funded by the ANR (projet LARDONS, ANR-2010,BLAN-0215) Typically, a GAI-decomposition is used to represent a utility function, which assigns a value to each possible object and hence, implicitly defines a complete pre-order on them (the greater the value, the more preferred the object). Such values may in some cases represent an amount of money which the user is ready to spend for the object, or may be defined implicitly by preferences on lotteries. However, in many applications, the actual values of the utility function are not important: it is the ranking of the objects that is induced by the utility function, and the properties of this ranking, that are interesting. Furthermore, the representation of the utility function is important too: it should enable fast answers for a variety of queries, not only dominance queries like Is object o preferred to object o?, but also more complex queries like What are the top-k objects that fulfil some given constraints?, useful for typical recommendation systems. In this paper, we investigate the use of GAI-decompositions for representing such ordinal rankings. Precisely, we take a GAIdecomposition to represent the ranking defined by the associated utility function. Since in general many different utility functions represent the same ranking, this leaves a lot more freedom for finding compact decompositions than if the utility function is fixed. In this context, we focus on the problem of learning a compact GAI-decomposition from (ordinal) examples, that is, essentially from statements of the form I prefer object o to object o. While some works on this topic have focused on a fixed target utility function (rather than a ranking) and assumed the structure (the scopes X i) to be known in advance, we consider the issue of learning any utility function which induces the target ranking, and assume nothing about the target structure except for a constant, known bound on its degree. We aim at finding a simple structure, in order to ease optimization queries (among others). This issue has not been addressed in the learning to rank literature, where the aim is usually to find a ranking function that can be used to answer dominance queries, as in e.g. [12, 10, 8]. After a review of GAI-decompositions (Section 2), we show in Section 3 that the GAI-decompositions consistent with a set of examples can be defined as the feasible solutions of a linear program. We give an efficient PAC-learner for our problem in Section 4, and extend our approach to the problem of learning minimal decompositions in Section 5. Some perspectives and are discusset in Section 6. 2 GAI-DECOMPOSITIONS In our context, the preference relation (or preferences) of a user on a set of objects χ is a complete pre-order, that is, a complete and transitive binary relation. Given two objects o, o χ, we take o o to mean that o is at least as interesting to the user as o. We write for the asymmetric part of the relation, and for its symmetric ECAI-12 Workshop on Preference Learning: Problems and Applications in AI 5

12 part. Hence is a linear order with possible ties on χ, is the strict part of it, and contains the ties. Generalized additive independence provides a representation for preferences on combinatorial domains. Hence we assume that the objects in χ are described over a set of n variables X = {X 1,..., X n}. We write D i for the (finite) domain of X i, hence the set of all objects is χ = D 1 D n. Though our results can be extended to arbitrary finite domains, for simplicity of exposition we consider Boolean domains, and we write D i = {x i, x i}. Slightly abusing notation, we also write objects of χ as sequences of values instead of as vectors. For instance, with X = {X 1, X 2, X 3, X 4}, the object (x 1, x 2, x 3, x 4) χ will be denoted by x 1 x 2x 3x 4. Finally, for any subset of variables Z X and object o, o[z] denotes the projection of o onto the variables in Z, and given a set of objects O χ, O[Z] = {o[z] o D 1 D n} denotes the set of projections of elements of O onto Z. It is easy to see that any complete and transitive preference relation can be represented by a utility function u : χ R satisfying o o u(o) u(o ) for all o, o χ. Clearly, since the set χ is combinatorial (it contains 2 n objects), it is impractical to directly elicit or explicitly store the relation or a representation u. However, in some cases, the utility function u satisfies strong independency properties between attributes [7], so that it can be represented by a set of local utility functions {u i : D i R i = 1,..., m} each of arity 1, satisfying u(o) = m i=1 ui(o[{xi}]) for all objects o. Such representations are clearly very compact, easy to elicit, and allow for efficiently computing optimal objects. Unfortunately, preference relations seldom satisfy this property of additive independence. Example 1. Consider the set of variables X = {X 1, X 2, X 3} with D 1 = {beef(b), fish(f)}, D 2 = {redwine(r), whitewine(w)}, D 3 = {lemon(l), mustard(m)}: χ contains 8 possible combinations. Consider the following ordering over χ: brm brl frm frl bwm bwl fwm fwl It can be represented with the additive utility function u defined by the following tables: u 1 : b 5 f 2 u 2 : r 5 w 1 Consider now the following ordering: u 3 : l 2 m 3 brm bwm fwl brl fwm frl bwl frm To represent this ordering with an additive utility, we should have u 1(b) > u 1(f) since brm is preferred to frm, and u 1(b) < u 1(f) since fwl is preferred to bwl. However, this ordering can be represented using local utilities over several variables: define for any object o, u(o) = u {X1,X 2 }(o[{x 1, X 2}]) + u {X1,X 3 }(o[{x 1, X 3}]) where u {X1,X 2 } and u {X1,X 3 } are defined by: u {X1,X 2 } : b, r 5 f, r 1 b, w 2 f, w 4 u {X1,X 3 } : b, l 3 b, m 7 f, l 5 f, m 2 Example 1 shows that some variables may depend on one another, and that in this case the utility function must be decomposed onto sets of variables rather than onto singletons. Definition 1 (GAI-decomposition). Let X = {X 1,..., X n} be a set of variables, χ = D 1 D n be a set of objects, and u : χ R be a utility function on χ. A GAI-decomposition of u is a finite set G = {u Z1,..., u Zm } of utility functions on subsets Z i of X (i.e., u Zi : χ[z i] R) such that u(o) = Σ m i=1u Zi (o[z i]) holds for all o χ. The degree of G is defined to be deg(g) = max i=1,...,m Z i, where Z i denotes the cardinality of Z i. Definition 2. Let u be a utility function, and let G be a GAIdecomposition of u. Then u (resp. G) is said to represent a preference relation iff o o u(o) u(o ) for all o, o χ. Considering u (resp. G) as given, we also say that it induces this relation and denote it by u (resp. G). The local utility functions u Z1,..., u Zm are also called GAItables, because they are typically implemented in tabular form. Clearly, for any utility function u, {u} is a GAI-decomposition of u of degree X. Also, writing u c for the constant function with value c, for any GAI-decomposition G of u and any set of variables Z X, G {u 0(Z)} is also a GAI-decomposition of u. Similarly, G {u c(z)} is a GAI-decomposition of u + c, and hence induces the same preferences as G. However, the most interesting decompositions are those which properly refine u. Definition 3 (utility-preserving refinement). Let G, G be two GAI-decompositions of the same utility function u. Then G = (u Z1,..., u Zm ) is said to u-refine G = (u Z 1,..., u Z ) if for m i = 1,..., m, there is j {1,..., m} with Z i Z j. Definition 4 (preference-preserving refinement). Let G, G be two GAI-decompositions of utility functions u, u, respectively. Then G = (u Z1,..., u Zm ) is said to refine G = (u Z 1,..., u Z ) if m G and G are the same relation and for i = 1,..., m, there is j {1,..., m} with Z i Z j. For both definitions, the refinement is said to be proper if moreover, for one relation Z i Z j as in the definition it holds Z i Z j, or for one Z j there is no Z i Z j. Refinement differs from u-refinement because the same preference relation can be represented by several utility functions. We will pay a particular attention to the maximally refined GAIdecompositions which represent a given preference relation. Example 2. Consider the set of boolean variables X = {X 1, X 2, X 3, X 4}. Let u be the GAI utility defined as the sum of u X1 X 2, u X1 X 3 and u X1 X 4, where these sub-utilities are defined by the following tables: x 1x 2 9 x 1 x 2 5 x 1x 2 5 x 1 x 2 2 x 1x 3 8 x 1 x 3 9 x 1x 3 6 x 1 x 3 9 x 1x 4 5 x 1 x 4 2 x 1x 4 4 x 1 x 4 1 It can easily be checked that the order over χ induced by u is also induced by the utility u defined as the sum of u X 1 X 2 and u X 2 X 3 X 4 with the following tables: x 1x 2 6 x 1 x 2 2 x 1x 2 1 x 1 x 2 0 x 2x 3x 4 3 x 2x 3x 4 2 x 2x 3 x 4 0 x 2x 3 x 4 0 x 2 x 3x 4 7 x 2 x 3x 4 4 x 2 x 3 x 4 1 x 2 x 3 x 4 1 Using a small program based on the ideas developed in the next section, we have checked that none of these two GAI-decompositions of the same pre-order can be refined. This example shows that there is not always a unique maximally refined GAI-decomposition inducing a given pre-order. ECAI-12 Workshop on Preference Learning: Problems and Applications in AI 6

13 3 REPRESENTATION OF EXAMPLES Our aim in this paper is to learn GAI-decompositions which induce a hidden target preference relation. Hence in the following we assume that there is a set of Boolean variables X = {X 1,..., X n}, which defines a set of objects χ, and a target preference relation on χ, hidden to the learner. The learner has access to information on through examples. Definition 5 (example). An example e of is a triple of the form (o, R, o ), where o, o χ and R is one of,,,,. For a set of examples E, we write O E for the set of all objects involved in at least one example of E. Examples formalize the information received by the learner, especially by observing the user. For instance, if the learner observes that the user always chooses o over o, it may represent this as the example (o,, o ). Similarly, if the user sometimes chooses o over o, and sometimes o over o, this may be represented as the example (o,, o ), etc. Definition 6 (consistency). A GAI-decomposition G of a utility function u is said to be consistent with a set of examples E if for every example (o, R, o ) E, u(o) > u(o ) (respectively u(o) u(o ), u(o) = u(o ),... ) holds if R is the relation (respectively,,... ). Clearly, given a constant k, there is not always a GAIdecomposition of degree k (or less) which is consistent with a given set of examples E. To formalize this, we define a set of examples E to be k-sound if there is at least one utility function u and a decomposition G of u, of degree at most k, that is consistent with E. We now define a system of linear inequalities, whose solutions essentially correspond to the GAI-decompositions of degree k consistent with E. Definition 7 (linear representation of an example). Let e = (o, R, o ) be an example of the target preference relation, and let k N. Moreover, let σ > 0 be a real constant, positive but arbitrary. Finally, for all subsets of variables Z X with 0 < Z k and assignments z to Z, let U Z,z be a formal variable. The linear inequality for e = (o, R, o ), k, σ, written ineqk σ (E) (or simply ineq k (E)) is defined to be U Z,o[Z] σ + Z X,0< Z k if R is the relation, to be U Z,o[Z] Z X,0< Z k Z X,0< Z k Z X,0< Z k U Z,o [Z] U Z,o [Z] if R is the relation, and similarly for the relations (using = in ineq k (E)), (using ), and (using and σ). Definition 8 (linear system). Let E be a set of examples of the target preference relation, and let k N, σ > 0. The linear system for E, k, σ is defined to be the conjunction of linear inequalities Σ σ k(e) = e E ineq σ k (e) (also written simply Σ k (E)). Intuitively, variables U Z,z encode the components of the GAItables in a decomposition G of the target relation. We use a constant σ for strict preference with the aim of using linear programming, for which we need a closed topological space. Proposition 1 below shows that this is without loss of generality. Importantly, note that the system Σ k (E) has at most k i=0 2i( ) n i variables (as many as possible assignments to subsets of at most k variables). However, another bound is obtained by observing that the variable U Z,z appears only if there is an object o O E with o[z] = z. Hence the number of variables occurring in Σ k (E) is at most k ( n ) i=0 i. OE. Whatever formula we use, provided k is bounded by a constant, the size of Σ k (E) is polynomial in the number of variables n and the number of examples E (using O E 2 E ). Example 3. Let o = x 1x 2 x 3 and o = x 1x 2x 3. The linear inequality associated with the exemple e = (o,, o ) for k = 2 and σ = 0.1 is (writing, for exemple, U x1 x 2 for U {X1,X 2 },x 1 x 2 ) : U x1 + U x2 + U x3 + U x1 x 2 + U x1 x 3 + U x2 x 3 U x1 + U x2 + U x3 + U x1 x 2 + U x1 x 3 + U x2 x We now show that the linear system Σ k (E) characterizes the GAIdecompositions of degree at most k and consistent with E. For technical reasons, we restricted ourselves to utility functions u with span at least σ, that is, satisfying u(o) u(o ) σ for all o, o with u(o) u(o ). This is without loss of generality however, since if u has span σ u < σ, then u, defined by u (o) = σ σ u u(o) for all o χ, is consistent with E as well and has span σ. Proposition 1. Let be a preference relation on χ, let E be a set of examples for, and let k N, σ > 0. Then the GAI-decompositions of degree at most k, span of at least σ, and consistent with E are exactly the solutions of Σ k (E). 4 LEARNING In this section we give an algorithm which, given a constant k N and a k-sound, hidden target preference relation, learns a GAI-decomposition G of from examples only, in the Probably Approximately Correct (PAC) framework [17] (see Section 4.2). Our algorithm essentially maintains the version space of all GAIdecompositions of degree k (and span at least σ) consistent with the examples received so far, using a compact representation by Σ k (E). 4.1 VC-Dimension So as to study the number of examples needed to learn, we first study the Vapnik-Chervonenkis dimension (VC-dimension for short) of the class G k of all relations which can be represented by a GAIdecomposition of degree at most k. The VC-dimension concerns classes of binary concepts, that is, concepts c which assign one of two values to any object x (values c(x) and c(x)). Hence we view as the two binary concepts and over objects (o, o ) χ χ (and G k, G k denote the corresponding classes of concepts). This gives an equivalent view since, for instance, o o is equivalent to o o, o o is equivalent to o o o o, etc. Intuitively, the VC-dimension of is the largest number of independent couples (o, R, o ), in the sense that the relation R of none depends on the relation of the others. Definition 9 (VC-dimension). Let C be a set of binary concepts over χ χ. A set of couples O χ χ is said to be shattered by C if for any partition {O +, O } of O, there is a concept c C satisfying (o, o ) O +, c(o, o ) and (o, o ) O, c(o, o ). The VC-dimension of C is the size of the largest set O that is shattered by C. ECAI-12 Workshop on Preference Learning: Problems and Applications in AI 7

14 We now give the VC-dimension of classes G k, G k. The fact that it is polynomial could not be taken for granted even for constant k, since a priori arbitrary values can occur in each entry of the GAItables. Proposition 2. The VC-dimension of G k (resp. G k ) is in O(2 k n k+1 ), where n denotes the number of variables over which the objects are defined. Proof Let K = k i=0 2i( n i) (hence K O(2 k n k+1 )). We show that no set O χ χ containing more than K couples (o, o ) is shattered, from what the claim will follow. By duality, we give the proof for G k. So let O χ χ with O K + 1. For all couples (o, o ) O we define the following formal sum, with variables U Z,z as in Definition 7: V k o,o = Z X,0< Z k U Z,o[Z] U Z,o [Z] which corresponds to combining the rhs and lhs of any linear inequality associated to o and o. All sums V k o,o (for (o, o ) O) use variables among the same K variables U Z,z (0 < Z k). Hence if O contains at least K + 1 couples, there is at least one of them, which we write (ω, ω ), such that the sum V k ω,ω is a linear combination of the others, that is, there are values λ o,o (for (o, o ) O \ {(ω, ω )}) which satisfy V k ω,ω = (o,o ) O\{(ω,ω )} λ o,o V k o,o We write O (resp. O > ) for the set of all couples (o, o ) O \ {(ω, ω )} with λ o,o 0 (resp. λ o,o > 0). First assume O >. We show that no concept in G k is consistent with the partition defined by O + = O > and O = O {(ω, ω )}, that is, no concept satisfies o o for all (o, o ) O >, o o (i.e., o o ) for all (o, o ) O, and ω ω. Indeed, given those labels and using Proposition 1, we get that the following linear system must be satisfied (for an arbitrary constant σ > 0): V k o,o σ ( (o, o ) O > ) V k o,o 0 ( (o, o ) O ) Because of the signs of λ o,o s, it follows that the following inequalities must be satisfied: λ o,o V k o,o λ o,o σ ( (o, o ) O > ) λ o,o V k o,o 0 ( (o, o ) O ) Hence all solutions of this system must satisfy (o,o ) O > O λ o,o V k o,o σ that is (using O > O = O \ {(ω, ω )}), V ω,ω σ (o,o ) O > λ o,o (o,o ) O > λ o,o Because of σ > 0, O > and λ o,o > 0 for (o, o ) O >, it follows that ω ω is impossible, so O is not shattered by G k. Now assume O > =. We show that no concept in G k is consistent with the partition defined by O + = O {(ω, ω )} and O =. Indeed, reasoning as above we get: (o,o ) O λ o,o V k V k o,o 0 ( (o, o ) O ) λ o,o V k o,o 0 ( (o, o ) O ) V ω,ω 0 o,o 0 and hence, ω ω is impossible, showing again that O is not shattered by G k. Since O was arbitrary of size at least K, we conclude that the VC-dimension of G k is at most K. 4.2 Algorithm We now give an algorithm for learning a GAI decomposition of a hidden preference relation accessed through examples. We use the probably approximately correct learning (PAC learning) framework proposed by Valiant [17]: the learner asks for a number m of examples (o, R, o ) of the target relation, and computes a preference relation. The number m of examples in the sample is chosen by the learner as a function of the number of variables n and two real parameters ε, δ ]0, 1[. Each example is drawn at random according to a probability distribution D on χ χ; D is fixed but unknown to the learner. For any couple (o, o ) drawn from χ χ, the learner receives the example (o, R, o ), where R is determined by (without noise). In this context, an algorithm is a PAC-learner if it outputs a concept which with probability at least 1 δ has error less than ε on couples drawn from χ χ according to D. Formally, {D(o, o ) oro but (or o )} < ε holds with probability at least 1 δ, where R is any relation in {,, }, the number m of examples asked by the learner is polynomial in n, 1/ε, 1/δ, the algorithm runs in time polynomial in n, 1/ε, 1/δ (counting unit time for asking and receiving an example). A concept class is said to be efficiently PAC-learnable if such a PAC-learner exists for the class. The framework of PAC-learning captures situations where the learner observes some objects in its environment (those that come to it it cannot choose which), and is given by a teacher the correct labels for these objects. Some objects occurring possibly more seldom than others (as formalized by the distribution D), the learner has less chances to learn with them, but it is less penalized by errors on them. In order to show that for fixed, constant k, the class G k of GAIdecompositions having degree at most k are PAC-learnable, we follow the classical consistent learning approach. The learner maintains a concept (in fact, the version space of all concepts) consistent with each of the examples received so far, namely, it maintains the linear system Σ k (E) (for a fixed but arbitrary span σ > 0). Figure 1 depicts the algorithm. Since is assumed to be k-sound and our setting is noise-free, the algorithm always returns a solution, i.e., Σ k (E) is necessarily a consistent system. The number m of examples which the learner needs is given by the following proposition. Proposition 3. For any constant k > 0, the class G k of GAIdecompositions of degree at most k is efficiently PAC-learnable. The number m of examples required by Algorithm GAI-Learning is in O(max( 1 log 1, 2k n k+1 log 1 ). ε δ ε ε ECAI-12 Workshop on Preference Learning: Problems and Applications in AI 8

15 Algorithm 1: The GAI-Learning Algorithm begin Σ k (E) ; for i = 1,..., m do ask for an example e of the form (o, R, o ); add ineq k (e) to Σ k (E); compute a solution Sol of Σ k (E); return the GAI-decomposition encoded by Sol; end Proof. Proposition 1 shows that Σ k (E) is solvable, and that the concept returned by the algorithm is consistent with all the examples received. We determine m using the VC-dimension of G k and G k. Following [2], because the binary relations and uniquely determine the concept (see the discussion before Definition 9), we define the dimension d of the class of non-binary concepts G k to be the maximum of the VC-dimension of G k and G k, hence d O(2k n k+1 ); intuitively, a learner learns and in parallel using the same examples for learning both, and deduces (for more details we refer the reader to [2]). Then we can apply the well-known generic result of Blumer et al. [3, Theorem 2.1 (ii)] to get that a number of examples m O(max( 1 log 1, d log 1 ) is enough for any concept consistent ε δ ε ε with the examples received to be probably approximately correct. Finally, since m is polynomial in n, 1/ε, 1/δ (k is bounded), the size of Σ k (E) is polynomial. Since linear programming is polynomial, the proof is complete. 5 MINIMIZING GAI-DECOMPOSITIONS So far we have shown that for any constant k (small in practice), the class G k of GAI-decompositions with degree at most k is PAClearnable. However, our solution does not distinguish between a decomposition with degree k and one with degree k k, nor does it distinguish between one with t clusters of variables Z i and one with t t clusters, etc. We now briefly discuss how such parameters can be optimized. The first natural objective is to learn a GAI-decomposition which is maximally u-refined, that is, which cannot be decomposed further while preserving the function. Lemma 1. Let u Z be a utility function with a nontrivial GAI-decomposition (u Z1,..., u Zm ). Then ( z D(Z) uz(z) > ) zi D(Zi) uzi (zi) holds. i Proof. We have by definition of a decomposition u Z(z) = u Zi (z[z i]) = i i z D(Z) z D(Z) z D(Z) u Zi (z[z i]) Now because there are 2 Z Zi assignments to Z which match z on Z i, it follows: 2 Z Zi u Zi (z i) z D(Z) u Z(z) = i z i D(Z i ) Finally, because of Z i Z we have 2 Z Z i 1 for all i, and because the GAI-decomposition is not trivial, we have 2 Z Zi > 1 for at least one i, and hence u Zi (z i) z D(Z) u Z(z) > i z i D(Z i ) as desired. Corollary 1. Let E be a k-sound set of examples. Then minimizing the objective function ( ) 0< Z k z D(Z) uz(z), under the constraints in Σ k (E) plus the constraint U Z,z 0 (for all Z, z), yields a GAI-decomposition (u Z1,..., u Zm ) which is consistent with E and in which no u Zi can be u-refined. Proof. Observe that due to the nonnegativity constraint on U Z,z s, the minimum is well-defined. Now towards a contradiction, let G = (u Z 1,..., u Z m ) be an optimal solution, and let (wlog) G 1 = (u Z11,..., u Z1p ) (Z 1i Z 1) be a nontrivial u-refinement of u Z1. Define G to be obtained from G by replacing u Z 1 with G 1. Then clearly G is a utility-preserving refinement of G, hence both represent the same utility function. It follows that G is also consistent with E and also a feasible solution of the program (in particular, it has the same span σ). Now by Lemma 1, G has a better value than G, which contradicts the optimality of G. Let D(Z, E) denotes the set of the z D(Z) such that there si some o O E with z = o[z]: obviously, for every z D(Z), if z / D(Z, E) then the minimization will yield U Z,z = 0. Therefore, the linear program of Corollary 1 can be expressed as follows: minimize U Z,z Z X,0< Z k,z D(Z,E) (P1) under constraints ineq k (e) for every e E U Z,o[Z] 0 for every Z, o However, though (P1) achieves some kind of minimality, it does not tend to minimize the number of nonempty entries (nonnull values) in the tables, as the following example shows. Example 4. Consider two boolean variables X 1, X 2, and let E = {x 1x 2 x 1 x 2, x 1 x 2 x 1 x 2, x 1 x 2 x 1x 2}. Fix k = 2, σ = 1, and consider the following decompositions (u 1, u 12) and (u 12), which are both consistent with E: u 1 : x 1 1 x 1 0 u 12 : x 1x 2 2 x 1 x 2 1 x 1 x 2 1 x 1x 2 0 u 12 : x 1x 2 3 x 1 x 2 2 x 1 x 2 1 x 1x 2 0 The decomposition (u 12) has fewer non-zero entries than (u 1, u 12), however it can be seen that the latter has a better value for (P1) than the former. Despite this, we can apply some efficient post-processing to the solution computed for (P1), by reporting the values in the table of Z i to the table of Z j Z i, if any. Clearly, minimizing the number of nonempty entries allows for more efficient storage of the decomposition learnt. In order to minimize this number over all possible GAI-decompositions consistent with the example, one has to resort to mixed-integer programming, using an additional set of 0/1 variables of the form V Z,z (recording whether the entry u Z(z) is nonempty). This yields the following program (using standard constructs): ECAI-12 Workshop on Preference Learning: Problems and Applications in AI 9

16 (P2) minimize V Z,z Z X,0< Z k,z D(Z,E) under constraints ineq k (e) for every e E U Z,o[Z] 0 for every Z, o V Z,o[Z] U Z,o[Z] for every Z X, 0 < Z k, o O E V Z,o[Z] {0, 1} for every Z X, 0 < Z k, o O E Note that the constant σ must be small enough so that constraining the U Z,o[Z] s to be in the [0, 1] interval does not artificially eliminate some solutions. Finally, a natural objective is to minimize the degree of the GAIdecomposition learnt (given that it will be at most k anyway). Example 5. Consider three boolean variables X 1, X 2, X 3, let k = 3, σ = 1, and E = {x 1x 2x 3 x 1x 2x 3, x 1x 2 x 3 x 1 x 2 x 3}. Consider the decompositions (u 123) and (u 1, u 2) defined as follows: u 123 : x 1x 2x 3 1 x 1x 2 x 3 1 else 0 u 1 : x 1 1 x 1 0 u 2 : x 2 1 x 2 0 Both decompositions are consistent with E. Moreover, (u 123) is clearly an optimum of (P1) and of (P2), but it does not have minimal degree, since (u 1, u 2) has degree 1. Again, one could resort to mixed-integer programming to minimize the degree over all decompositions consistent with the examples. Nevertheless, it is clearly a more efficient approach to proceed by exhaustive search (or by dichotomy): if there is a decomposition of degree k, then look for one with degree k 1, etc. 6 CONCLUSION In this paper, we have shown that any complete preorder on a combinatorial domain, which can be represented by a GAI-decomposition of degree k, can also be seen as the solution of a system of linear equations. For a given k, a set of examples of such a hidden preorder (encoding a preference relation) leads to a linear system whose variables encode the components of the GAI-tables. A judicious choice of an objective function allows to get a minimal decomposition (with various notions of minimality). With this in hand, we designed an algorithm which learns a GAIdecomposition of a hidden preference relation, provided a constant bound on the degree of such a decomposition is known a priori, in the framework of PAC-learning. To our knowledge, this is the first algorithm able to learn GAI-decompositions without being given the structure of the tables (the sets of variables Z i). Even if we require a constant bound on the degree to be given, the result could not be taken for granted, since the presence of arbitrary values in the GAItables makes a priori GAI-decompositions of degree k a very expressive class (hence difficult to learn). On the practical side, requiring a small, constant bound typically fits in the applicative context, where (human) users usually have very local preferences. When the training set is noisy or the chosen bound k is too low, the linear system has no solution. It is simple to relax the system by introducing a slack variable δ e in the left part of each inequality ineq k (e). Then the sum of the δ i s is obviously to be minimized (the variant of the algorithm given in the paper corresponds to null δ i s). The next development of this work is the design of a good strategy for the choice of the degree of the GAI to be learnt. A simple approach is to follow an increasing strategy, first assuming that variables are independent, then setting k = 2 and so on, until a good coverage of the examples is reached. A finer strategy would be to use a tolerance parameter and to analyze the (imperfect) graph learnt for k = 2 in order to have a better idea of the dependencies: the knowledge of a clique on a set Y of variables leading to the necessity of the local utility function u Y. More generally, such further development would imply interleaving two procedures, one being devoted to learning k (or the structure of the GAI), and the other to learning the entries in the tables using linear programming. Last, but not least, we shall go back to the development of (active) elicitation methods close to the ones used in CP-net learning. The idea is to ask the user a series of question, whose answer allows the learner to infer (in)dependencies between variables [13]. In this context, the equations and variables of the linear system would show up only when necessary, leading to a much more efficient procedure in terms of memory. Acknowledgments We thank all anonymous reviewers of ECAI 2012 for helpful comments. REFERENCES [1] Fahiem Bacchus and Adam Grove, Graphical models for preference and utility, in Proceedings of UAI 95, pp. 3 10, (1995). [2] Shai Ben-David, Nicolò Cesa-Bianchi, David Haussler, and Philip M. Long, Characterizations of learnability for classes of {0,...,n}-valued functions, J. Of Computer and System Sciences, 50, 74 86, (1995). [3] Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K. Warmuth, Learnability and the vapnik-chervonenkis dimension, J. ACM, 36(4), , (1989). [4] Craig Boutilier, Fahiem Bacchus, and Ronen I. Brafman, Ucpnetworks: A directed graphical representation of conditional utilities, in Proceedings of UAI 01, (2001). [5] Craig Boutilier, Ronen Brafman, Holger Hoos, and David Poole, Reasoning with conditional ceteris paribus preference statem, in Proceedings of UAI-99, pp , San Francisco, CA, (1999). [6] Craig Boutilier, Ronen I. Brafman, Carmel Domshlak, Holger H. Hoos, and David Poole, Cp-nets: A tool for representing and reasoning with conditional ceteris paribus preference statements, JAIR, 21, , (2004). [7] G. Debreu, Topological methods in cardinal utility theory, in Mathematical Methods in the Social Sciences, eds., K.J. Arrow, S. Karlin, and P. Suppes, 16 26, Stanford University Press, Stanford, (1960). [8] Carmel Domshlak and Thorsten Joachims, Unstructuring user preferences: Efficient non-parametric utility revelation, in Proc. of UAI 05, pp , (2005). [9] P.C. Fishburn, Utility Theory for Decision Making, Wiley, New York, [10] Yoav Freund, Raj D. Iyer, Robert E. Schapire, and Yoram Singer, An efficient boosting algorithm for combining preferences, Journal of Machine Learning Research, 4, , (2003). [11] Christophe Gonzales and Patrice Perny, Gai networks for utility elicitation, in Proceedings of KR 04, pp , (2004). [12] Thorsten Joachims, Optimizing search engines using clickthrough data, in Proc. of ACM KDD 02, pp [13] Frédéric Koriche and Bruno Zanuttini, Learning conditional preference networks with queries, in Proc. IJCAI 09, pp , (2009). [14] Daniel Mailharro, A classification and constraint-based framework for configuration, Artificial Intelligence for Engineering Design, Analysis and Manufacturing, , (September 1998). [15] Anand S. Rao and Michael P. Georgeff, Modeling rational agents within a bdi-architecture, in Proc. KR 92, pp , (1991). [16] Thomas Schiex, Hélène Fargier, and Gérard Verfaillie, Valued constraint satisfaction problems: Hard and easy problems, in Proceedings of IJCAI 95), pp Morgan Kaufmann, (aot 1995). [17] Leslie G. Valiant, A theory of the learnable, Communications of the ACM, 27(11), , (1984). [18] Paolo Viappiani, Boi Faltings, and Pearl Pu, Preference-based search using example-critiquing with suggestions, JAIR, 27, , (2006). ECAI-12 Workshop on Preference Learning: Problems and Applications in AI 10

17 Learning Conditional Lexicographic Preference Trees Michael Bräuning and Eyke Hüllermeier 1 Abstract. We introduce a generalization of lexicographic orders and argue that this generalization constitutes an interesting model class for preference learning in general and ranking in particular. We propose a learning algorithm for inducing a so-called conditional lexicographic preference tree from a given set of training data in the form of pairwise comparisons between objects. Experimentally, we validate our algorithm in the setting of multipartite ranking. 1 INTRODUCTION Preference learning is an emerging subfield of machine learning that has received increasing attention in recent years [13]. A specific though important special case of preference learning is learning to rank, that is, the learning of models that can be used to predict preferences in the form of rankings of a set of alternatives [7, 8]. Ranking problems are often reduced to problems of a simpler type, such as learning a value function that assigns scores to alternatives (with better alternatives having higher scores) or learning a binary predicate that compares pairs of alternatives [15]; while the former approach is close to regression, the latter is in the realm of classification learning. Another approach to learning ranking functions is to proceed from specific model assumptions, that is, assumptions about the structure of the sought preference relations. This approach is less generic than the previous ones, as it strongly depends on the concrete assumptions made. On the other hand, it typically offers the advantage of being more easily understandable and interpretable. An example is the representation of preferences in the form of a CP-net [5]. Another example is lexicographic orders that are widely accepted as a plausible representation of (human) preferences [16], especially in complex decision making domains [1]. Here, the assumption is that the target ranking of a set of alternatives, each one described in terms of multiple attributes, can be represented as a lexicographic order. From a machine learning point of view, assumptions of the above type can be seen as an inductive bias restricting the hypothesis space. Provided the bias is correct, this is clearly an advantage, as it may simplify the learning problem. On the other hand, an overly strong bias may prevent the learner from approximating the target ranking sufficiently well. For example, while being plausible in some situations, the assumption of a lexicographic order will be too restrictive for many applications. In this paper, we therefore present a method for learning generalized lexicographic orders. While still being simple and easy to understand, the model class we consider relaxes some of the assumptions of a proper lexicographic order. More specifically, we increase flexibility thanks to two extensions of conventional lexicographic orders. 1 Department of Mathematics and Computer Science, University of Marburg, Germany, {braeunim,eyke}@mathematik.uni-marburg.de First, we allow for conditioning [3, 4]: The importance of attributes as well as the preferences for the values of an attribute may depend on the values of other variables preceding that one in the underlying variable order. Second, we allow for grouping [17]: Several (one-dimensional) variables can be grouped into a single high-dimensional variable, and preferences can be specified on the cartesian product of the corresponding domains. The remainder of this paper is organized as follows. In the next section, we give a brief overview of related work. In Section 3, we introduce generalized lexicographic orders and the notion of conditional lexicographic preference trees. In Section 4, we present an algorithm for learning such preference models from data. An experimental study is presented in Section 5, prior to concluding the paper in Section 6. 2 RELATED WORK The use of lexicographic orders in preference modeling has already been considered in the seventies of the last century [10], whereas in machine learning, this type of structure has attracted attention only recently. Flach and Matsubara developed a lexicographic ranker called LexRank, using a linear preference ordering on attributes derived by the odds ratio [12, 11]. Experimentally, they show that LexRank is competitive to decision trees and naive Bayes in terms of ranking performance (AUC). Further work on learning lexicographic orders was done by Schmitt and Martignon [16], Dombi et al. [9], and Yaman et al. [18]. However, these works are based on rather simplistic assumptions. More general models were studied by Booth et al. [3, 4], and in fact, important parts of our approach (such as conditional importance of attributes and conditional preferences on attribute values) is inspired by these models. Their work remains rather theoretical, however, without a practical realization in terms of an implementation of algorithms or an experimental study with real data. 3 GENERALIZED LEXICOGRAPHIC ORDERS Formally, we proceed from an attribute-value representation of decision alternatives or objects, i.e., an object is represented as a vector o O = D(V ) = D(A 1)... D(A n), where V = {A 1,..., A n} is the set of attributes (variables) and D(A i) is the domain of attribute A i. For a subset A = {A i1,..., A ik } V of attributes we define D(A) = D(A i1 )... D(A ik ). An assignment or instantiation of a subset A V of attributes is an element a D(A); an assignment is called complete if A = ECAI-12 Workshop on Preference Learning: Problems and Applications in AI 11

18 V, otherwise it is called partial. For an object o O and a subset A V, we denote by o[a] the projection of o from D(V ) to D(A); if A = {A k } is a single attribute, we also write o[k] instead of o[{a k }]. A lexicographic order on O is a total order defined in terms of a total order on V, i.e., a ranking of the attributes, a total order i on each attribute domain D(A i). More specifically, o o (suggesting that o is preferred to o) if and only if there exists a k {1,..., n} such that ( o [k] k o[k] ) ((A i A k ) ( o [i] = o[i] )) for all i {1,..., n}. The relations i indicate preference on individual attributes: a i b means that, for a, b D(A k ), a is preferred to b as a value for attribute A i. Moreover, the relation reflects the importance of attributes: A i A j means that attribute A i is more important than A j, whence the former is considered prior to the latter. Without loss of generality, we shall subsequently assume that A 1 A 2 A n (unless otherwise stated). 3.1 Conditional preferences on attribute values Conventional lexicographic orders assume that preferences k on attribute domains are independent of each other. Needless to say, this assumption is often violated in practice. For example, although it is possible that a person prefers red wine to white wine in general, it is also plausible that her preference for wine may depend on the main dish: red is preferred to white in the case of meat, whereas white is preferred to red in the case of fish. In order to capture attribute dependencies of that type, the preferences relations k can be conditioned on the values of the attributes A j preceding A k in the order [3, 4]. That is, k is now replaced by a set of strict orders { } (a 1,...,a k 1 ) k (a 1,..., a k 1 ) D({A 1,..., A k 1 }) Moreover, the order relation on O is then defined as follows: o o for o = (a 1,..., a n) and o = (a 1,..., a n) if and only if there exists a k {1,..., n} such that ( i {1,..., k 1} : a i = a i ) 3.2 Conditional attribute importance ( a k (a 1,...,a k 1 ) k a k ). Going one step further, one may assume that the values of the first attributes in the attribute order do not only influence the preferences on the values of the attributes that follow, but also the importance of the attributes themselves [3, 4]. Thus, we are no longer dealing with a lexicographic order in the sense that defines a sequence of the attributes V according to their importance. Instead, we are dealing with a tree-like structure. This structure is defined by the following (choice) function: ( ) A = C (A i1, A i2,..., A ik ), (a i1, a i2,..., a ik ), where (A i1, A i2,..., A ik ) V k is a sequence of attributes (such that A ij A ik for j k) and a ij D(A ij ) for all j {1,..., k}. Moreover, A V \ {A i1,..., A ik } is the most important attribute given that A ij = a ij for all j {1,..., k}. 3.3 Variable grouping Another extension consists of grouping several variables, that is, to allow the expression of preferences on attribute tuples instead of single attributes only [17]. Formally, this means selecting an index set I {1,..., n} and defining a total order relation I on the cartesian product D(V I) of the domains D(A i), i I. Note that the possibility of variable grouping significantly increases the expressivity of the model class. In particular, taking I = {1,..., n}, it is possible to define every order on D(V ), that is, to sort the set of alternatives in any way. Since this level of expressivity is normally not desirable, it is reasonable to restrict to variable grouping of order g max, meaning to impose the constraint I g max for a fixed g max n. 3.4 Conditional lexicographic preference trees Combining the generalizations discussed above, we end up with what we call a Conditional Lexicographic Preference Tree (CLPT). Graphically, this is a tree structure in which every node is labeled with a subset of attributes V I and a total order on the cartesian product D(V I) of the corresponding attribute domains D(A i), i I; there is one outgoing edge (descendant node) for each value o[v I] D(V I); every attribute A i V occurs at most once on each branch from the root of the tree to a leaf node (i.e., the index sets I along a branch are disjoint). We call a CLPT complete if every attribute A i V occurs exactly once on each branch from the root of the tree to a leaf node (i.e., the index sets I along a branch form a partition of {1,..., n}). A (complete) CLPT can be thought of a defining an order relation on O through recursive refinement of a weak order, that is, by refining an order relation with tie groups in a recursive manner (in the following, and denote, respectively, the symmetric and asymmetric part of ): One starts with a single equivalence class (tie group), i.e., o o for all o, o O. Let the root of the CLPT be labeled with the attribute set V I, and let I denote the corresponding order on D(V I). The current order is then refined by letting o o whenever o [V I] I o[v I]; otherwise, if o [V I] = o[v I], then o and o remain tied. Thus, a linear order of tie groups (equivalence classes) is produced. Each equivalence class (represented by a value a D(V I)) is then recursively refined by the subtree the objects of this equivalence class are passed to. Note that, if the CLPT is complete, the order relation eventually produced is a total order. 4 LEARNING CLPTs In this section, we outline a method for inducing a CLPT from training data T = { (o i, o } N i) (1) i=1 that consists of a set of object pairs (o i, o i) O 2, suggesting that o i is preferred to o i. Roughly speaking, this means finding a CLPT whose induced order relation on O is as much as possible in ECAI-12 Workshop on Preference Learning: Problems and Applications in AI 12

19 agreement with the pairwise preferences in T (without overfitting the training data). The induced order relation is a total order if the CLPT is complete. 4.1 Performance and evaluation measures In order to evaluate the predictive performance of a CLPT, there is a need to compare the order relation (with asymmetric part ) induced by this model with a ground truth order. As will be seen below, the same measures can be used to fit a CLPT to a given set of training data (1) during the training phase. In this case, the ground truth is not a total order but a set of pairwise comparisons between objects. Since a total order can be decomposed into (a quadratic number of) such comparisons, too, we can assume (without loss of generality) that we compare with a set T of pairs (o, o) O 2, suggesting that o should be ranked higher than o. Inspired by the corresponding notions introduced in [6], we define two performance measures of correctness and completeness, respectively, as follows: where CR(, T ) = CP(, T ) = { } C = (o, o) T o o, { D = (o, o) T o o }. C D C + D, (2) C + D, T (3) Note that CR(, T ) assumes values between 1 (complete disagreement) and +1 (complete agreement), while CP(, T ) ranges between 0 (no comparisons) and 1 (full comparison). 4.2 A greedy learning procedure We implement an algorithm for learning a CLPT as a (greedy) search in the space of tree structures based on the greedy algorithms presented by Schmitt and Martignon [16] as well as Booth et al. [3, 4]. This is done by constructing the tree from the root to the leaves in a recursive manner. In each step of the recursion, a new node is created with an associated subset V I of attributes, where V I g max, and a total order I on D(V I) Creating a node The problem to be solved in each recursion is the following: Given a set of pairwise comparisons T and a set V V of attributes still available, select a most suitable subset V I V and an order I. Following a greedy strategy, we choose (V I, I) so as to maximize correctness (2), using completeness (3) as a second criterion to break ties. The selection of an attribute subset V I can be done through exhaustive search if its size is sufficiently limited, i.e., if the upper bound g max is small. Otherwise, a complete enumeration of all possibilities may become too expensive. Moreover, for each candidate subset V I, a total order I needs to be determined. Again, all such orders can be tried if D(V I) is not too large. Otherwise, heuristic ranking procedures such as Borda count can be used (counting the number of wins and losses of each value a D(V I) in the training data T and sorting according to the difference) Limiting the number of candidate subsets In order to avoid a complete enumeration of all candidate subsets V I of size g max, we combine a greedy search with a kind of lookahead procedure: We provisionally create a node by selecting a single attribute instead of a subset, i.e., we tentatively set g max to 1; apart from that, exactly the same selection procedure (as outlined above) is applied. This step is repeated g max times, thereby producing a subtree of depth g max. Let V V denote the subset of attributes that occur in this subtree, i.e., that are chosen in at least one of the nodes. Then, as candidate subsets V I, we only try subsets V, i.e., subsets V I V such that V I g max. Obviously, the underlying assumption is that an attribute that has not been chosen in any of the g max steps is not important at this point Recursion Once an optimal subset V I has been chosen, the training examples (o, o) with o [V I] o[v I] are removed from T (since they are sorted at this node). Moreover, for each value a D(V I), a data set { } T a = (o, o) T o [V I] = o[v I] = a is created and passed to the corresponding successor node (together with V \ V I as the attributes that have not been used so far). The same recursive procedure is then applied to each of these successor nodes Initialization and termination The learning procedure is called with the original training set T and the full set V of attributes as candidates. The recursion terminates if no attribute is left (V = ) or if the set of training examples is empty (T = ) CLeRa We call the algorithm outlined above CLeRa, which is short for Conditional Lexicographic Ranker. The CLPT induced by CLeRa can be used to compare new object pairs {o, o} O. To this end, the tuple is submitted to the root and propagated through the tree until either a leaf node is reached or a node at which o [V I] o[v I]; in this case, o o is decided if o I o and o o if o I o. Otherwise, if o [V I] = o[v I] in all nodes traversed by the two objects, then o o. Given not only a pair but a complete set of objects to be ranked, the pairwise comparison realized by the CLPT can be embedded in any standard sorting algorithm, such as insertion sort. Note that, since o o is possible in a pairwise comparison, the result of the sorting procedure will in general only be a weak order. 5 EXPERIMENTAL RESULTS We evaluate our approach on 15 benchmark data sets from the Statlog and the UCI repository [2]. These data sets, which define binary or ordinal classification problems, were pre-processed as follows: numerical attributes and attributes with more than five values were discretized into four values using equal frequency binning. Moreover, instances with missing values were neglected. The learning problem we consider is multipartite ranking [14]: Given a set of test instances X O, the goal is to predict a ranking ECAI-12 Workshop on Preference Learning: Problems and Applications in AI 13

20 that agrees with the (ordered) class labels of these instances. Formally, this agreement is measured in terms of the so-called C-index, which can be seen as an extension of the AUC: C = 1 i<j ninj 1 i<j m (o,o ) X i X j I(o o)+ 1 2 I(o o), where X i X denotes the set of instances with class labels y i, and these class labels are assumed to have the order y 1 < y 2 < < y m. The training data consists of a set of labeled instances, just like in classification. Since CLeRa is learning from pairwise comparisons of the form (o, o), it first extracts such comparisons from the original data by looking at the class information: A preference (o, o) is generated for each pair (o, y j) and (o, y i) of labeled instances in the (original) training data such that y i < y j. The ranking performance of CLeRa (with maximum grouping size of g max = 2) is compared with LexRank, which was implemented as proposed by Flach and Matsubara [12, 11]; therefore, this method was only applied to binary (two-class) problems but not to problems with more than two classes. 2 We applied naive Bayes (NB) and decision tree (J48) learning as additional baselines, using the standard implementations in Weka (trees are not pruned) and sorting instances according to the estimated probability of the positive class; note that these methods are not applicable to the multi-class case either. Table 1. Average performance in terms of C-index based on a 10-fold cross-validation. Dataset CLeRa LexRank J48 NB Red Wine Census Income Credit Approval Mammographic Mass Mushroom SPECT Heart Ionosphere MAGIC Gamma Telescope Breast Cancer Wisconsin German Credit Car Evaluation n/a n/a n/a Nursery n/a n/a n/a Tic-Tac-Toe Endgame n/a n/a n/a Vehicle n/a n/a n/a Cardiocraphic n/a n/a n/a The results of a 10-fold cross-validation are given in Table 1. Since CLeRa produced a completeness of 1 or extremely close to 1 throughout, these values are not reported here. Overall, the performance of the methods is quite comparable. In particular, CLeRa and LexRank produce quite similar results on many data sets. In some cases, however, the results are strongly in favor of CLeRa (Census Income, Ionosphere, MAGIC Gamma Telescope, German Credit). Probably, this is because the bias imposed by the assumption of a standard lexicographic order is inadequate for these data sets, and hence our extensions (conditional attribute importance, conditional value preferences, variable grouping) clearly pay off. 6 CONCLUSIONS AND FUTURE WORK Lexicographic orders constitute an interesting model class for preference learning, which allows for representing rankings of a set of objects in a very compact and comprehensible way. Yet, as we have 2 The red wine data actually has a target attribute with values between 1 and 10; it was binarized by thresholding at the median. argued in this paper, this model class may not be flexible enough for many real-world applications. Therefore, we have proposed to weaken the assumptions underlying a lexicographic order in various directions, allowing for conditional attribute importance, conditional preferences on attribute values, and variable grouping. Moreover, we have proposed an algorithm called CLeRa, which learns preference models in the form of conditional lexicographic preference trees from training data in the form of pairwise comparisons between objects. First experimental results in the setting of multipartite ranking are quite promising and show CLeRa to be competitive with other methods. In a direct comparison with an existing lexicographic ranker, the benefit of our extensions are becoming quite obvious. Important topics of future work can be found both on the theoretical and practical side. In particular, we are currently studying formal properties of our generalized model class, such as its expressiveness and means for regularization and complexity control. Practically, there is certainly scope for improving our current algorithm, for example by devising a suitable procedure for estimating an optimal value g max for the oder of variable grouping. Moreover, improving the computational efficiency of CLeRa would be desirable, too. Last but not least, we are of course interested in real applications for which (generalized) lexicographic models appear to be an adequate representation. REFERENCES [1] M. Ahlert, Aggregation of lexicographic orderings, Homo Oeconomicus, 25(3/4), , (2008). [2] A. Asuncion and D.J. Newman. UCI machine learning repository, [3] R. Booth, Y. Chevaleyre, J. Lang, J. Mengin, and C. Sombattheera, Learning various classes of models of lexicographic orderings, in Workshop on Preferencce Learning at ECML-2009, (2009). [4] R. Booth, Y. Chevaleyre, J. Lang, J. Mengin, and C. Sombattheera, Learning conditionally lexicographic preference relations, in Proc. ECAI 2010, pp , Amsterdam, The Netherlands, (2010). IOS Press. [5] C. Boutilier, R.I. Brafman, C. Domshlak, H.H. Hoos, and D. Poole, CP-nets: A tool for representing and reasoning with conditional ceteris paribus preference statements, Journal of Artificial Intelligence, 21, , (2004). [6] W. Cheng, M. Rademaker, B. De Baets, and E. Hüllermeier, Predicting partial orders: Ranking with abstention, in Proc. ECML-2010, (2010). [7] William Cohen, Robert Schapire, and Yoram Singer, Learning to order things, Journal of Artificial Intelligence Research, 10, , (1999). [8] O. Dekel, C. Manning, and Y. Singer, Log-linear models for label ranking, in Advances in Neural Information Processing Systems, (2003). [9] J. Dombi, C. Imreh, and N. Vincze, Learning lexicographic orders, European Journal of Operational Research, 183(2), , (2007). [10] P.C. Fishburn, Lexicographic orders, utilities and decision rules: A survey, Management Science, 20(11), , (July 1974). [11] P.A Flach and E.T. Matsubara, On classification, ranking and probability estimation, in Probabilistic, Logical and Relational Learning - A Further Synthesis, (2007). [12] P.A. Flach and E.T. Matsubara, A simple lexicographic ranker and probability estimator, in Proceedings of the 18th European conference on Machine Learning, ECML 07, pp , Berlin, Heidelberg, (2007). Springer-Verlag. [13] J. Fürnkranz and E. Hüllermeier, Preference Learning, Springer-Verlag, Berlin, Heidelberg, [14] J. Fürnkranz, E. Hüllermeier, and S. Vanderlooy, Binary decomposition methods for multipartite ranking, in Proceedings ECML/PKDD 2009, European Conference on Machine Learning and Knowledge Discovery in Databases, Bled, Slovenia, (2009). [15] E. Hüllermeier, J. Fürnkranz, W. Cheng, and K. Brinker, Label ranking by learning pairwise preferences, Artificial Intelligence, 172, , (2008). ECAI-12 Workshop on Preference Learning: Problems and Applications in AI 14

21 [16] M. Schmitt and L. Martignon, On the complexity of learning lexicographic strategies, J. Mach. Learn. Res., 7, 55 83, (December 2006). [17] N. Wilson, Efficient inference for expressive comparative preference languages, in Proc. IJCAI-09, (2009). [18] F. Yaman, T.J. Walsh, M.L. Littman, and M. desjardins, Democratic approximation of lexicographic preference models, in Proc. ICML-08, pp , Helsinki, Finland, (2008). ECAI-12 Workshop on Preference Learning: Problems and Applications in AI 15

22 An apple-to-apple comparison of Learning-to-rank algorithms in terms of Normalized Discounted Cumulative Gain Róbert Busa-Fekete 1 2 and György Szarvas 3 and Tamás Éltető 4 and Balázs Kégl 5 Abstract. The Normalized Discounted Cumulative Gain (NDCG) is a widely used evaluation metric for learning-to-rank (LTR) systems. NDCG is designed for ranking tasks with more than one relevance levels. There are many freely available, open source tools for computing the NDCG score for a ranked result list. Even though the definition of NDCG is unambiguous, the various tools can produce different scores for ranked lists with certain properties, deteriorating the empirical tests in many published papers and thereby making the comparison of empirical results published in different studies difficult to compare. In this study, first, we identify the major differences between the various publicly available NDCG evaluation tools. Second, based on a set of comparative experiments using a common benchmark dataset in LTR research and 6 different LTR algorithms, we demonstrate how these differences affect the overall performance of different algorithms and the final scores that are used to compare different systems. 1 Introduction In subset ranking [3] (or web page ranking), the goal is to learn a ranking function that approximates the ideal partial ordering of a set of objects (or documents retrieved for the same query). The partial ordering is provided by relevance labels representing the relevance of documents with respect to the query on an absolute scale. In the past, manually designed ranking functions, such as BM25 [7], were used to rank the retrieved documents in web page ranking. More recently, this problem is tackled as a machine learning task, where the training data is given in the form of (query, document, relevance label) triplets. These machine learning based ranking approaches are referred to as learning-to-rank (LTR) systems. The Normalized Discounted Cumulative Gain (NDCG) is a widely used evaluation metric for learning-to-rank systems. NDCG is designed for ranking tasks with more than one relevance levels. There are many freely available, open source tools for computing the NDCG score for a ranked result list (for full list of these tools 1 Department of Mathematics and Computer Science, Marburg University, Germany, busarobi@mathematik.uni-marburg.de 2 R. Busa-Fekete is on leave from the Research Group on Artificial Intelligence of the Hungarian Academy of Sciences and University of Szeged. 3 Computer Science Department Technische Universität Darmstadt, Germany, szarvas@tk.informatik.tu-darmstadt.de 4 Ericsson Hungary, Könyves Kálmán krt. 11.B, 1097 Budapest, Hungary, tamas.elteto@ericsson.com 5 Linear Accelerator Laboratory (LAL) and Computer Science Laboratory (LRI), University of Paris-Sud, CNRS, Orsay, 91898, France, balazs.kegl@gmail.com see Section 3). Even though the definition of NDCG is unambiguous 6, the various tools can produce different scores for ranked lists with certain properties, deteriorating the empirical tests in many published papers and thereby making the comparison of empirical results published in different studies difficult to compare. We found that for certain benchmark datasets, the relative order of the performance of different LTR methods can change depending on which evaluation tool was used. The reason for this can be two-fold. First, the implemented NDCG calculation can itself result in different scores, and in some cases, a different order for the same ranked result lists. Second and more importantly, we found some of the LTR algorithms that compute the NDCG scores during the training phase to be sensitive to the different ways of computing the NDCG score, i.e. depending on which NDCG implementation is used, some LTR methods can produce different models that provide significantly different overall performance. In previous work[2], we recognized this, here we investigate the effect of using different evaluation tools more detailed. The contribution of this study is two-fold. First, we identify the major differences between the various publicly available NDCG evaluation tools. Second, based on a set of comparative experiments using a common benchmark dataset in LTR research and 6 different LTR algorithms, we demonstrate how these differences affect the overall performance of different learning methods and the final scores that are used to compare different systems. We analyze these differences and also draw conclusions on which metric should be used and why. This is not only an important step towards making the empirical research results reported in the literature uniformly comparable, but it also has implications on how benchmark datasets should be designed in order to be equally suited to train and evaluate any particular learning to rank algorithm. The rest of this paper is organized as follows. In the next section we will describe the formal setup, and then, in Section 3 we list all the evaluation tools that are available freely to compute NDCG and, in addition, we identify precisely the differences in the way they compute the NDCG score. Next, in Section 4 we briefly overview the LTR algorithms we are used for comparison in the experiments in Section 5. Finally, based on the observed differences in experimental results, we draw our conclusions in Section 6. 6 See, for example edu/ir-book/html/htmledition/ evaluation-of-ranked-retrieval-results-1.html ECAI-12 Workshop on Preference Learning: Problems and Applications in AI 16

23 2 Formal LTR task In this section we formally define the learning-to-rank problem and introduce the notation that will be used in the rest of the paper. Let us assume that we are given a set of query objects D = {D (1),..., D (M) }. Each query object D (k) consists of a set of n (k) pairs D (k) n`x(k) = 1, l(k) 1,..., `x(k) (k) o, l (k). n (k) n The real-valued feature vectors x (k) i represent the kth query and the ith document received as a potential hit for the query. 7 The label index l (k) i of the query-document pair x (k) i is an integer between 1 and K. We assume that we are given a set of numerical relevance grades Z = {z 1,..., z K}. The relevance grade z (k) i = z l (k) i expresses the relevance of the ith document to the kth query on a numerical scale. A popular choice for the numerical relevance grades is z l = 2 l 1 1 (1) for all l = 1,..., K. The goal of the ranker is to output a permutation j (k) = (j 1,..., j n (k)) over the integers (1,..., n (k) ) for each query object D (k). A widely used empirical measure of the quality of the permutation j (k) is the Discounted Cumulative Gain (DCG) DCG`j (k), D (k) X = n (k) i=1 c iz (k) j i, (2) where c i is the discount factor of the i th document in the permutation. The most common discount factor is c i = 1 log(1 + i). (3) The rationale of this formula is that a user will be particularly satisfied if he/she finds relevant documents early in the permutation. To normalize DCG between 0 and 1, (2) is usually divided with the DCG score of the best permutation (NDCG) which can be computed as IDCG`D (k) = max DCG`j, D (k) j This score referred to as the ideal DCG score. It is also a common practice to truncate the sum (2) at n max, defining the DCG nmax and NDCG nmax scores. The reason for this is that a user would rarely browse beyond the first page of search results containing the first n max hits. 3 Evaluation tools In this subsection we briefly describe and compare the various tools available to compute NDCG scores. The definition (2) is unambiguous, nevertheless, the tools can differ in the definition of the discount factor c i (3). More importantly, there can be an important difference in the way the DCG score is normalized when i) there is no relevant document for a query (z i = 0 for all i), or ii) when the number of documents is less then the truncation level n max. Although this 7 When it is not confusing, we will omit the query index and simply write x i for the ith document of a given query. seems to be a technical subtlety, it turns out that the confusion between the different tools can significantly alter the numerical scores and in some case can even change the relative ordering of the algorithms on benchmark datasets. We compared six evaluation tools computing the NDCG scores: 1. The LETOR 3.0 script implemented in Perl 8 2. The LETOR 4.0 script implemented in Perl 9 3. The MS script implemented in Perl The YAHOO script implemented in Python The RANKLIB package implemented in Java The TREC evaluation tool v8.1 implemented in C 13 The evaluation tools can be divided into three groups. The tools of the first group compute DCG nmax according to the definition (2) described in Section 2. The LETOR 3.0, YAHOO, and TREC tools belong to this group. All of these tools assign zero score to a query if it is empty, that is, z i = 0 for all i which means that there are no relevant documents. The TREC tool uses the labels of documents given in the input file as relevance grades by default. From this point of view, this is the most flexible implementation, since arbitrary relevance grades can be defined. For example, in the case of MQ2008 dataset, the labels 0, 1 and 2 should be simply replaced by 0, 1 and 3 respectively, to have the commonly used exponential grades (1). The second group is composed of the YAHOO tool alone. It also computes the DCG nmax according to the definition (2), but it assigns 1.0 to the empty queries. This is a minor difference that generates an additive bias between the NDCG nmax computed by YAHOO tool and the three tools of the first group. The third group consists of the LETOR 4.0 and MS tools. Except for a small technical difference (the LETOR 4.0 tool can be used for up to three relevance labels, whereas the MS tool can handle up to five relevance labels), they compute the same score. As the RANKLIB and LETOR 3.0 tools, they assign zero to a query where the ideal DCG nmax is zero. Their rather strange feature is that they also assign zero DCG nmax score to a query with less then n max documents in it, even if these documents are highly relevant. So, formally, they compute the DCG nmax score as DCG nmax`j(k), D (k) ( Pnmax i=1 c iz (k) j = i if n max n (k) (4) 0 otherwise. This truncation does not only distort the test score, but it can also alter the training of such algorithms that depend directly on the NDCG score. Indeed, for example, in ADARANK [9] which optimizes the NDCG 10 evaluation metric, a query containing less than 10 documents does not influence the computation of the coefficient of the weak ranker at all, and the the weight of such queries converge to zero over the boosting iterations. The only source of difference between the tools that have not been identified yet, is the way they sort the objects of interest based on the scores. All tools except TREC tool, make use of the default builtin, programing language dependent sorting function which mainly 8 projects/letor/letor3.0/evaluationtool.zip 9 projects/letor/letor4.0/evaluation/eval-score-4. 0.pl.txt 10 mslr/eval-score-mslr.pl.txt 11 evaluate.py.txt 12 vdang/ranklib.html 13 eval/ ECAI-12 Workshop on Preference Learning: Problems and Applications in AI 17

24 implement the quick sort algorithm. Therefore, the order of two elements with the same score depends on the implementation of the sorting algorithm used. Whereas the TREC tool uses the lexicographic ordering based on document ID given in the input file if the systempredicted scores are equal for two documents. We believe that none of these two ways of handling equal scores are desirable. On the one hand, the sorting algorithm should not have an effect on the evaluation itself. On the other hand, the document ID normally does not bear any useful information regarding the content of a document, so in this sense, the use of the ordering based on document ID corresponds to a particular random ordering which is also not a desired feature in an evaluation tool. 4 Learning-to-rank algorithms In our comparison study, we used six state-of-the-art ranking methods. Here we briefly summarize them. 1. ADARANK [9] is a listwise boosting approach aiming to optimize an arbitrary listwise IR metrics, such as the Mean Average Precision (MAP), ERR, or NDCG. Inspired by ADABOOST, it uses a stepwise greedy optimization technique for maximizing the chosen IR metrics. In every boosting iteration, ADARANK reweights the queries based on their scores obtained by the evaluation metrics: it up-weights the query having lower score and down-weights high-scoring queries. The weak learner is chosen by optimizing the listwise evaluation metrics of interest which is usually hard to optimize except for very simple weak classifiers. This can be viewed as a handicap of this method. According to the original implementation of ADARANK, we used the best feature ranker (BF) described above as base ranker taking into account the weighting of queries. The only hyperparameter of ADARANK is the number of boosting iterations which we optimized by using early-stopping on the validation set. We refer to this method as ADARANK.{NDCG}. 2. RANKNET [1] is a neural-net-based method which employs a loss based on pairwise cross entropy as its objective function. The neural net with one output node is trained to optimize directly the differentiable probabilistic pairwise loss instead of the common squared loss. We validated the number of hidden layers ranging from 1 to 3 and the number of neurons in the hidden layers ranging from 10 to 500. For the number of training epochs we applied early stopping. 3. RANKBOOST [4] is a pairwise boosting approach. The objective function is the rank loss (as opposed to ADABOOST which optimizes the exponential loss). In each boosting iteration the weak classifier is chosen by maximizing the weighted rank loss. For the weak learner we used decision stumps and a variant of the single decision stump described in [4] which is able to optimize the rank loss in an efficient way. 4. RANKSVM [5] is a pairwise method based on SVM, formulating the ranking task as a binary classification. We used linear kernel because the optimization using non-linear kernels cannot be carried out in reasonable time. The tolerance level of the optimization was set to and the regularization parameter was validated in the interval [10 6, 10 4 ] with a logarithmically increasing step 5. COORDINATEASCENT (CA) [6] is a linear listwise model where the scores of the query-document pairs are calculated as weighted combinations of the feature values. The weights are tuned by using a coordinate ascent optimization method where the objective function is an arbitrary evaluation metrics given by the user. The coordinate ascent optimization method itself has two hyperparameters to be tuned: the number of restarts R from random initial weights, and the number of iterations T taken after each restart. We used R = 30 and T = 100. We did not validate these hyperparameters, but using the validation set we evaluated every model obtained due to restarting the optimization process, and we kept the one having highest performance. 6. LAMBDAMART [8] is a boosted regression tree model. Since it handles the LTR problem as a regression task, it could be classified as pointwise method, but during the training phase, it adjust the parameters of the regression trees based on the derivative estimate of NDCG, therefore it is considered as a listwise approach. We validated the number of boosting iterations. The number of leaves were set to 10 and the learning rate to Experiments We identified two major differences in the way the DCG score is normalized in publicly available NDCG evaluation scripts: 1. When there is no relevant document for a query (z i = 0 for all i), evaluation scripts conforming to the definition assign a 0.0 NDCG value to the query. Optionally, some scripts may assign a different value as default for such queries (namely, 1.0 for the Yahoo metric). 2. When the number of documents is less then the truncation level n max, evaluation scripts conforming to the definition assign an NDCG n (k) value equal to the number of documents n (k) available for that query (thereby assuming that it is always possible to fill in a result list with irrelevant documents, and at the same time assuming that the provided relevance labeling is exhaustive (i.e. there are no further, unseen relevant documents and thus the use of IDCG n (k) score in the normalization step is a realistic estimate. 14 ) Optionally, some scripts may assign a different value as default for such queries (namely, 0.0 for the LETOR metric). We consider the difference stemming from the different strategy to order results with equal predicted scores less crucial (in practice two different documents seldom get the same predicted score, i.e. ties are rare) and we do not assess its impact, and we use all metrics with exponential relevance grades (the TREC tool can be parameterized to use exponential gain, while the other tools are implemented to use this). That is, in our experiments we consider the Ranklib and TREC evaluation tools to be equivalent and compare their scores to those provided by the YAHOO and LETOR metrics, which are representative examples for the two major differences we found between the various tools. In Figure 1, we plot the NDCG 1 10 scores for 4 characteristic learning to rank algorithms (ADARANK, LAMBDAMART, RANKNET and CA ) using the three different evaluation formulas to evaluate the models: in each row, the left, middle and right plots show the values provided by the RANKLIB/TREC, the YAHOO and the LETOR tools, respectively. All models were trained to optimize the NDCG 10 scores, and the plots show the NDCG scores of these models for cut-off values from 1 to 10. Each plot shows 3 different curves that correspond to models trained using a specific metric during the training of the model: the blue, black and red lines indicate models learned using the YAHOO, RANKLIB/TREC and LETOR metrics, respectively. This way we can visually compare the effects 14 Note that relevance labels are many times pooled in IR datasets and there is no guarantee that no relevant, but unlabeled documents exist. ECAI-12 Workshop on Preference Learning: Problems and Applications in AI 18

25 RankLib LETOR YAHOO Position RankLib LETOR YAHOO Position RankLib LETOR YAHOO Position (a) ADARANK/YAHOO (b) ADARANK/TREC (c) ADARANK/LETOR RankLib LETOR YAHOO NDCG@k NDCG@k NDCG@k Position RankLib LETOR YAHOO Position RankLib LETOR YAHOO Position (d) LAMBDAMART/YAHOO (e) LAMBDAMART/TREC (f) LAMBDAMART/LETOR RankLib LETOR YAHOO NDCG@k NDCG@k NDCG@k Position RankLib LETOR YAHOO Position RankLib LETOR YAHOO Position (g) RANKNET/YAHOO (h) RANKNET/TREC (i) RANKNET/LETOR RankLib LETOR YAHOO NDCG@k NDCG@k NDCG@k Position RankLib LETOR YAHOO Position RankLib LETOR YAHOO Position (j) CA/YAHOO (k) CA/TREC (l) CA/LETOR Figure 1. The dependence of NDCG 10 scores on NDCG method used on MQ2008 dataset. ECAI-12 Workshop on Preference Learning: Problems and Applications in AI 19

26 AdaRank.NDCG LambdaMART RankBoost RankNet RankSVM CA Position NDCG@k Position NDCG@k Position (a) YAHOO (b) TREC (c) LETOR Figure 2. The dependence of NDCG 10 scores on NDCG method used on MQ2008 dataset. of the different evaluation tools both on on the numeric outcomes (comparing the curves with same color, denoting the same models, across the 3 plots horizontally) and on the learnt model (contrasting the 3 curves on the same plots, denoting models learnt using different measures). We can make several observations based on figure1. First, for all the algorithms in figure 1, the NDCG scores provided by the YAHOO tool are on average around 0.25 higher than the scores provided by the RANKLIB/TREC tool which conforms the definition of NDCG. This difference comes solely from the presence of queries with 0 relevant documents in the dataset: for those queries, the YAHOO tool assign 1.0 as NDCG which results in an additive bias compared to the official NDCG scores. On the other hand, the scores provided by the LETOR tool degrade rapidly for cutoff values of 8 10 for all algorithms, with a difference around or above 0.25 compared to the official NDCG scores. As opposed to the former additive difference this behavior might result in a different ordering of the algorithms: a method that is on average superior to others on larger sets but is weaker than the competitors on small sets is preferred by the Letor metric, as this tool assigns zero to small sets (smaller than the NDCG cut-off). Second, and more interestingly, we can observe that it is by no means irrelevant which metric was used during the training of the algorithms: for those algorithms that make use of the NDCG scores in some way in the training process (namely, the ADARANK, LAMB- DAMART and CA methods), we see differences in the performance of the learnt models. We observe that in general, the LETOR tool is not suited for training the algorithms its property to assign 0.0 score to small sets causes these small sets to be useless for training (to optimize NDCG 10), i.e. the algorithms can exploit less data in a meaningful way to learn patterns. This results in a significantly 15 worse performance for these algorithms, when trained using the LETOR metric (instead of the official NDCG scores). On the other hand, for some algorithms, it is worth to assign a non-zero score to sets which contain no relevant documents at all: doing so, the algorithms do not increase the weight of such queries similarly to other low-performing queries where there is hope to improve performance. This can result in better learning rates and in some cases slightly better performance (see e.g. the blue vs. black curves for the ADARANK and LAMB- DAMART methods). Overall, we can observe that the best metric to train the algorithms is the one provided by YAHOO: in general this results in equal or better learnt models than the other evaluation tools. To shed light on how these characteristics of the evaluation tools affect the rela- 15 The error bars on the plots correspond to the standard errors of querywise NDCG scores averaged out in quadrature over the folds. tive performance of benchmark LTR algorithms, in figure 2, we plot the NDCG scores for 6 different learning algorithms (trained using the official NDCG metric) with all the evaluation tools surveyed in this study. As can be seen, the performance of the best algorithms is very close to each other, with Coordinate Ascent having a slight advantage regardless which metric is used for evaluation. Slightly worse than CA, the three algorithms RANKBOOST, RANKSVM and LAMBDAMART perform very close to each other. If we compare the results obtained with different evaluation measures, the relative order of these (otherwise very similar) models change depending on the metric: apparently the RANKBOOST algorithm has a slight advantage over the other two for short sets, which advantage disappears when the LETOR metric is used for evaluation. In general, we were unable to reproduce the competitive results of ADARANK [9], using the reimplementation of the algorithm in the Ranklib package 16, Conclusion In this study, we reviewed the publicly available NDCG evaluation scripts, identified and compared the differences between them and systematically analyzed how these differences affect the numeric results and the training processes of various learning-to-rank algorithms. It is reasonable to assume that most previous studies use one of the assessed evaluation tools, and at the same time it seems likely that most measures are used at least in some studies to evaluate the performance of machine learnt ranking systems. We found that there indeed are differences between tools despite the relative simplicity of the popular NDCG evaluation metric, and our experiments demonstrate that these differences can easily lead to non-trivial differences between research results. Since most studies do not discuss such details as the evaluation script used, this fact makes previous studies very difficult to compare and might lead to the misinterpretation of results and false conclusions. We identified that the two key points of difference between different tools are i) the way how small queries (queries with less than n max documents in the case of NDCG nmax ) and ii) the way how queries with no relevant document (i.e. when all documents have the same, zero relevance score and therefore the ideal DCG is zero) are handled by the tools. These two factors can lead to different overall scores (for the same model) and also to inherently different learnt models, depending on which learning algorithm was used. Some divergences from the definition of the NDCG measure might be jus vdang/ranklib.html 17 We found no clear indication in the original article what stopping criterion was used to terminate the iteration in the training process, so we tried to use various stopping strategies and provide the best results we obtained. These are nevertheless lower than those reported at the letor website. ECAI-12 Workshop on Preference Learning: Problems and Applications in AI 20

27 tified from an ML perspective though: for example, if the measure assigns a perfect score to queries with zero relevant documents, these queries are not treated like other queries that can be improved, and do not distort the model. On the other hand, we could not identify any benefit of other modifications, such as zeroing out the NDCG nmax score for queries with less than n max documents in total. To summarize our findings, we suggest the following protocols to make learning to rank results more comparable and benchmark datasets more suited to training and evaluating systems: 1. The use of tools that give zero score to small queries should be avoided, as these can negatively impact certain algorithms (while others are unaffected) and thus the reported results can represent a false relative order of the algorithms. 2. If possible, benchmark datasets should be free of not meaningful queries. Queries that does not have any relevant documents do not play a role in learning to rank they cannot be used to learn meaningful patterns and they do not have an impact on evaluation. At the same time, such queries can negatively impact the performance of some algorithms, and can motivate non-standard evaluation tools (c.f. the YAHOO metric). 3. Regardless which evaluation script is used while training the systems (c.f. the YAHOO tool which is reasonable in the presence of queries without relevant documents in the data), the final evaluation should be carried out using a tool fully conforming the official NDCG definition. This way machine learnt performance scores would become directly comparable to non-machine learnt ones, such as those coming from TREC and other evaluation exercises. REFERENCES [1] C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender, Learning to rank using gradient descent, in Proceedings of the 22th International Conference on Machine Learning, (2005). [2] R. Busa-Fekete, B. Kégl, T. Éltető, and Gy. Szarvas, Tune and mix: learning to rank using ensembles of calibrated multi-class classifiers, Machine Learning accepted, (2012). [3] D. Cossock and T. Zhang, Statistical analysis of Bayes optimal subset ranking, IEEE Transactions on Information Theory, 54(11), , (2008). [4] Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer, An efficient boosting algorithm for combining preferences, Journal of Machine Learning Research, 4, , (2003). [5] R. Herbrich, T. Graepel, and K. Obermayer, Large margin rank boundaries for ordinal regression, in Advances in Large Margin Classifiers, eds., Smola, Bartlett, Schoelkopf, and Schuurmans, pp MIT Press, Cambridge, MA, (2000). [6] D. Metzler and B. W. Croft, Linear feature-based models for information retrieval, Information Retrieval, 10, , (2007). [7] S. Robertson and H. Zaragoza, The probabilistic relevance framework: BM25 and beyond, Found. Trends Inf. Retr., 3, , (2009). [8] Q. Wu, C.J.C. Burges, K.M. Svore, and J. Gao, Adapting boosting for information retrieval measures, Information Retrieval, 13(3), , (2010). [9] J. Xu and H. Li, AdaRank: a boosting algorithm for information retrieval, in SIGIR 07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pp ACM, (2007). ECAI-12 Workshop on Preference Learning: Problems and Applications in AI 21

28 Alleviating cold-user start problem with users social network data in recommendation systems Eduardo Castillejo and Aitor Almeida and Diego López-de-Ipiña 1 Abstract. The Internet and the Web 2.0 have radically changed the way of purchasing items, provoking the fall of geographic selling barriers all over the world. So large is the amount of data and items we can find in the Web that it turned out to be almost unmanageable. Due to this situation many algorithms have emerged trying to filter items for e-commerce users based in their tastes. In order to do this, these systems need information about the tastes of the users as input. This limitation is reduced as the users interaction with these systems increases. The main problem arises when new users enter a recommendation platform for the first time. The so called cold-start problem causes unsatisfactory random recommendations, which goes against these systems purpose. Cold-start includes users entering new systems, items, and even new systems. This situation challenges for new ways of obtaining user data. Social networks can be seen as huge information databases sources, and social network analysis would help us to do it using different techniques. In this paper, we present a solution which uses social network user data to generate first recommendations, alleviating the cold-user limitation. Besides, we have demonstrate that it is possible to reduce the colduser problem applying our solution in a recommendation system environment. 1 INTRODUCTION The amount of information in the world is increasing far more quickly than our ability to process it [17]. Common users of Web based systems usually have to deal with such amount of data that their interaction can become slow, ending in serious loss of users attention, which also means losses of sells and user satisfaction. For example, buying a CD or a vinyl in a music store has been an habit from the 70s and 80s. People used to go to the store and navigate through tens or hundreds of albums seeking those which fit with their tastes. But nowadays, with all possibilities Internet offers to every user this amount of items has been increased to millions, even more. Therefore, taking advantage of the possibilities the Web 2.0 provides to us all, researches started to design algorithms which were able to filter the information (or items) to the user. These algorithms started to compound what we today know as recommender systems. These systems use the opinions of a community of users to help individuals in that community to more effectively identify content of interest from a potentially overwhelming set of choices [16]. Within past years there have been many progresses in this area [4]. Some systems, such as YouTube, started to store some information about users searches to infer their tastes [8, 5] managing some explicit and 1 Deusto Institute of Technology - DeustoTech, University of Deusto, Avda. Universidades 24, Bilbao, Spain. {eduardo.castillejo, aitor.almeida, dipina}@deusto.es implicit information about users interactions. Others, such as Amazon.com 2, take into account the users ratings and purchases [12]. Both points of view are different sides of the same coin and follow the same purpose: to present to the user the most suitable amount of items. Despite the advances in this area there are some intrinsic problems that are still unsolved. Probably the most important one is the so called cold-user problem. It emerges every time a new user interacts with recommender system by the first time. Without any search, rating or purchase a recommender system is unable to find any interesting items. Sometimes it even has been necessary many ratings before being able to provide a reasonable recommendation [2]. Given the problem we asked ourselves a simple question: Is there any other way to infer user tastes without their active participation in the system? We think there is, and in this work we pretend to reduce the cited problem taking into account users social interactions. Social networks strength lies in the possibility of establishing some social relationships among people, companies and other groups using the Internet as the bridge of communication. The range of accessible social networks is very diverse, from those which just take into account the user location in order to rate a place, bar or store, to those which allows users to share not only their thoughts or beliefs, even their personal pictures, videos and music. There is such amount of information of the users in these networks that new challenges arise to take advantage of it. This way, our proposal seizes the opportunity of exploiting social network data in order to reduce the cold-start problem in any recommendation system. Applying social network analysis techniques we are able to get some recommendations to the user (with certain accuracy) based on the relationships with others in social networks. The remainder of this paper is structured as follows: first, in Section 2 we analyse the current state of the art in recommendation systems and the most popular techniques to avoid the cold-user or cold-start problem. Next we present our methodology for collecting valuable data from social networks taking into account users relationships (Section 3). In Section 4 we analyse the results obtained from our proposal. Finally, in Section 5, we summarize our experiences and discuss the conclusions and future work. 2 RELATED WORK Since the mid-1990s recommender systems have become an important research area attracting the attention of e-commerce companies. Amazon [12], Netflix 3 and Yahoo! Music 4 [6] are widespread exam ECAI-12 Workshop on Preference Learning: Problems and Applications in AI 22

29 Table 1: Some of the best known metrics in social network analysis. Metric Betweenness Centrality Closeness Cohesion Degree Eigenvector centrality Description It takes into account the connectivity of the node s neighbours to reflect the number. of people who a person is connecting indirectly through their direct links. It gives a rough indication of the social power of a node based on how well they connect the network. It reflects the ability to access information through the grapevine of network members. It measures the degree to which actors are connected directly to each other by cohesive bonds. The count of the number of ties to other actors in the network. A measure of the importance of a node in a network. ferent aspects in a social network taking into account the nodes and their edges. Table 1 shows some of the best known metrics in SNA. As depicted in Section 3 we have chosen the eigenvector centrality metric to face up to the cold-start problem. A variant of eigenvector centrality is used by Google search engine to rank Web pages [11], but it is based in the premise that the system already has data from the user to work with. ples on making recommendations to its users based on their tastes and previous purchases. Although these systems have evolved becoming more accurate, the main problem is still out there: to estimate the rating of an item which has not been seen by users. This estimation is usually based on the rest of items rated by the current user or on the ratings given by others where the rating pattern is similar to the user s one. Although there are different kinds of recommendation systems (content-based, collaborative filtering and hybrid techniques) [4] they all suffer from the same main limitations: sparsity and scalability [17] and cold-start problems [15]. Some authors have tried to avoid the problem of cold-start users by asking them a series of questions about their tastes or by proposing some studied items in order to get any rating [13, 7]. As we presume these solutions can usually cause displeasure on the users, becoming tedious and cumbersome activities. An-Te Nguyen et al. [4] have tried to reduce the cold-user problem by exploiting available data (e.g. age, occupation, location, etc.). In [18] authors present a metric (the CROC curve) to improve the evaluation of a recommender system performance. Moreover, some authors have improved their recommendation algorithms combining users social data from social networks [10, 9]. Although they don t tackle the cold-start problem, their idea of using the available social information of users as an input represents a new starting point for these systems (e.g. Foursquare 5 adds information about user geolocation). There is also an open research in the Carnegie Mellon University s School of Computer Science about how do people really inhabit their cities based on Foursquare data. The project groups check-ins by physical proximity and it measures social proximity by how often different people check in similar places. This way resulting areas are dubbed [3]. But in a social network there is more beyond users, relationships and data. Social network analysis (SNA) refers to methods used to analyse social networks, social structures made up of individuals called nodes, which are connected by different representations of relationships (e.g. friendship, kinship, financial exchange, etc.). Figure 1 represents an example of a social network graph in which different nodes are connected each other by relation lines called links. Once we have empirical data on a social network new questions arise: Which nodes are the most central members? Which are the most peripheral? Which people are influenced by others? Which connections are most crucial? These questions and their answers represent the basic domain of SNA [14]. There are many metrics which measure dif- 5 Figure 1: An example of a collaborative social network. Squares represent nodes (people) and the edges represent social ties between them. Yellow squares represent important nodes which relate different sub-graphs [14]. 3 PROPOSED SOLUTION This section details the developed system to enable the generation of generic recommendations to the user based in the rest of the users who checked in the same venue with Foursquare. To get the rest of the users (the network nodes) who checked in the same place we have used the Foursquare API. Once we have the nodes we calculate those which are the most important in the network at the current venue and then we obtain the recommendations which fit better to the user linking using probabilistic. 3.1 Foursquare API Foursquare is a location-based social networking website which allows users to check in at venues using their smartphones. The Foursquare API 6 gives access to all of the data used by the Foursquare mobile applications. Thanks to it developers can request some user data (e.g. location, friends, last check-ins, etc.). We have developed an Android mobile application which allows users to check in desired venues as they would do with Foursquare official application or website (see Figure 3). Once the user checks in any venue, our algorithm is launched: First, we must authenticate the user with the Oauth protocol (required by the Foursquare API). Then we are able to get the nodes who have checked in the current 6 ECAI-12 Workshop on Preference Learning: Problems and Applications in AI 23

venue. To do this it is necessary to use the herenow 7 API endpoint aspect, which responses a count and items (where items are Checkin 8 responses) in JSON format.

2). That s the point of the herenow endpoint, to get the previous check-ins done by other users at the current checked in venue.

Each A ij element will be tied to another with a default weight value of 1. Then we start looking to the createdat field of every Checkin object in the herenow JSON response.

30 venue. To do this it is necessary to use the herenow 7 API endpoint aspect, which responses a count and items (where items are Checkin 8 responses) in JSON format. For example, doing a check-in at the Colosseo, in Rome, we obtained a JSON herenow response containing 4 items representing 4 different people who had previously checked in the same venue (see Figure 2). That s the point of the herenow endpoint, to get the previous check-ins done by other users at the current checked in venue. Therefore, we can build a 5 5 square adjacency matrix (3) representing the network graph at the current venue for each user (the first column relates the current user with the others). Each A ij element will be tied to another with a default weight value of 1. Then we start looking to the createdat field of every Checkin object in the herenow JSON response. This field is a long number, and it s value is based in the the Unix epoch time (it stores the number of seconds that have elapsed since midnight Coordinated Universal Time (UTC), January 1, 1970). We have split the importance of every check-in in 3 time intervals, adding weights to the values in the matrix in the corresponding A ij position. We decided this because a Foursquare check-in has a lifetime of 3 hours approximately, and that is the reason why we have chosen a 3 hours time interval to give weights to the check-ins. We also believe that people are defined by their acts, and this is why we defend the temporal closeness approach. Being at same places at similar time periods can be used as a tool for relate different people with some and probable similar tastes. To complete the weights assignation we must take into account that the herenow request returns not only the list of the people who have checked in the same venue but also a list of our Foursquare friends who have also done it. Accordingly to this, we believe that our friends have an extra weight (or importance) in the network. The following (1) and (2) equations detail the possible weights distribution when a user checks in a venue. Given priority to user s friends we can see how the maximum weight for an unknown user is 3, although for a friend scales up to 6. We have implemented such algorithm because being friends in a social network does not directly imply that users tastes are the same. current user user 1 user 4 user 2 user 3 Legend Friend user Unknown user 1 hour interval 2 hours interval 3 hours interval Users' check-ins time stamp current user user 1 user 2 user 3 user 4 1:00 pm 2:00 pm 3:00 pm 1:00 pm 4:00 pm Figure 2: Example of a graph built from a user check-in where there are 4 more Foursquare users who had checked in the same venue. Matrix A (4) represents the same graph. 1 default weight +3 if check-in interval 1 hour Unknown user = +2 if 1 hour < check-in interval 2 hour (1) +1 if 2 hour < check-in interval 3 hour 3 default weight +3 if check-in interval 1 hour F riend user = +2 if 1 hour < check-in interval 2 hour (2) +1 if 2 hour < check-in interval 3 hour The following matrices show how the first default values are added just when the adjacency matrix A is built and then how it is completed with the weights assignations (A ). A is built by adding to A the corresponding values of the relationships between the current user and the others (see Figure 2 and equations (1) and (2)). Regarding at node A 01 we can see that it has a value of 6. This is because of the combination of the relationship between the current user and the user 1 (they are friends) and the check-in time interval (1 hour) A = (3) A = A = (a) (b) (c) Figure 3: A user checks in a venue with developed Android application. First a map is shown (a). Once the user touches the screen a near venues list is presented (b). Finally, when a venue is selected, the check-in is performed (c). 3.2 Eigenvector centrality As we have depicted in Section 2 a network is a graph made up of points, nodes or vertices tied each other by edges. We can also represent these graphs by a so called adjacency symmetric (n n) matrix, where n is the number of nodes. This matrix has elements { } 1 if there is an edge between vertices i and j A ij = 0 otherwise Eigenvector centrality is a more sophisticated version of the degree metric, which is a simple way to measure the the influence or importance of a node [14]. The degree k i of a node i is k i = (4) (5) n A ij, (6) j=1 where A is the adjacency matrix which represents the ties between nodes i and j. ECAI-12 Workshop on Preference Learning: Problems and Applications in AI 24

31 Since degree centrality gives a simple count of the number of ties a node has, eigenvector centrality acknowledges that not all connections are equal. Therefore, and because some edges represent stronger connections than others, the edges can be weighted. To sum up, connections to nodes which are themselves influential to others will lend a node more influence than connections to less influential nodes. Denoting the centrality of a node i by x i, then it is possible to make x i proportional to the average of the centralities of i s network neighbours: x i = 1 λ n A ijx j, (7) j=1 where λ is a constant. This equation can be also rewritten defining the vector of centralities x = (x 1, x 2,...): λx = A x, (8) where x is an eigenvector of the adjacency matrix with eigenvalue λ. All these calculations are computed in the user s device. The JAMA [1] library has helped us to easily obtain the most important nodes of the graph (we have also used an online tool 9 to rapidly check the JAMA obtained values). The following method uses this library to obtain the eigenvectors to the given adjacency matrix. First the corresponding eigenvalues are estimated. Then we extract those values of the eigenvectors which are related with the highest value of the obtained eigenvalues, which corresponds to the most important node of the grid. Applying this eigenvector calculation to the A matrix from Section 3.1 we obtain that the highest eigenvalue λ 1 = , and the corresponding eigenvector to this eigenvalue is e 1 = , (9) where the first value corresponds to the current user (and it is also the highest value), so we have to ignore it. The next highest value is the one we will take into account as the most important node of the grid. We have encapsulated the Foursquare user object into a new CompactUser which also has a set of recommendations assigned to it, each one composed by a series of items. Taking as example the generic categories of Amazon.com we have tested our solution using a few controlled users who are friends in Foursquare and some random generated users in order to have a controlled scenario. Results are detailed in Section 4. Once the recommendations of the most important users are obtained, we upload them to a web server by Google AppEngine 10 using a simple Python service. For each user we store all possible recommendations (we manage nine main categories) and we update the estimate and the probability of fitting with his tastes. To evaluate this the developed application asks first about user s tastes among the cited categories. This information is stored in a SQLite database (this is just to evaluate the solution). The service responses a JSON object with the recommendations and their likelihood probabilities for the user. This JSON object is parsed in the device side in order to generate the corresponding recommendations to the user RESULTS Our solution has been evaluated by presenting to our test users the default categories that Amazon.com uses and another list with our categories recommendations. Once our users have compared both lists, they have fulfilled a questionnaire to capture their satisfaction level with the obtained results. Amazon.com default recommendations are the following: Kindle related products Clothing trends Products being seen by other customers Best watches prices Laptops best prices Top seller books These recommendations (not categories) are not based in any user preference. On the contrary our list of recommendations establish a new order within Amazon.com s categories, indicating the probability of each one to be in the tastes of users. Navigating to Amazon.com website with privacy mode enabled allows us to see default recommendations, without taking into account any previous purchase or search. Our objective is focused in recommend items from Amazon.com categories (listed in Table 3). Testing our solution with real users we obtained a certain approximation to their tastes. Despite the few users, check-ins and data we have access to, the system is capable of generating first generic Amazon.com recommendations (as we have already detailed in this section). Table 3 show the probabilities obtained for a user with the following tastes (see the Amazon.com whole categories listed in Table 2): Automotive & industrial Movies, music, games Electronics & computers Sports & outdoors The middle column shows the system generated probability for each category with just an input of 3 user check-in. On the contrary the right column values corresponds to the same user doing 5 checkins. These values are more refined, being more in accordance with the user known tastes. Finally users have to fulfil a questionnaire rating the presented categories. This rating includes mandatory and controlled answers, from 1 to 4. This way we can compare the results with the obtained probabilities. Table 2 shows the probabilities calculated for a user from the resulting questionnaire answers. Then Table 4 compares both probabilities and calculates the approximation of each estimation. It is important to emphasize that a new matrix is built with every user check-in. This means that previous matrices are overwritten. However, the probabilities are dynamic, and they are refined every time the user checks in a venue. The more approximate to 0.0 ɛ is, the more accurate our solution becomes. There are some values of ɛ which shows that there are needed more check-ins to refine the obtained probabilities. On the one hand, in case of 3 check-in results the worst ones are for Automotive & industrial, Grocery, health & beauty, Toys, kids & baby and Sports & outdoors. This means that if the user is interested in Sports & outdoors the system could not recommend any item from this category, or even worse, recommend items from Grocery, health & beauty. On the other hand, 5 check-in error column is more accurate and its values are closer to 0.0. This test shows how more check-ins come out onto more refined recommendations. ECAI-12 Workshop on Preference Learning: Problems and Applications in AI 25

32 5 CONCLUSIONS AND FUTURE WORK Table 2: Results obtained from a user questionnaire answers. Left column corresponds to the recommendations names. Right column shows the taste probabilities obtained from the questionnaire. Recommendation Probability Home, garden & tools (1) Clothing, shoes & jewelry (2) Books (3) Electronics & computers (4) Automotive & industrial (5) Movies, music, games (6) Grocery, health & beauty (7) Toys, kids & baby (8) Sports & outdoors (9) Table 3: Results obtained from our system for one user performing 3 and 5 check-ins. Rec. Probability (3 check-ins) (5 check-ins) (1) (2) (3) (4) (5) (6) (7) (8) , (9) 0.0 0, Table 4: Comparison of probabilities obtained from the questionnaire and the probabilities calculated with the proposed solution using 3 and 5 check-in results. Rec. App. 3 check-ins 5 check-ins 3 check-in ε 5 check-in ε (1) (2) (3) (4) (5) (6) (7) (8) (9) Approximation (App.) = (Solution probability / Questionnaire probability) Deviation error (ε) = 1 - approximation This paper explores the possibility of using relevant data from users social network to alleviate the cold-user problems in a recommender system domain. The proposed solution extracts the most valuable node in the graph generated by check in a venue with an Android application using the Foursquare API. By obtaining the recommendations to this node we estimate the probability of some categories to be similar to users tastes. In the near future we will take into account data not only from Foursquare. Other social networks with accessible APIs will be useful enough to determinate more accurately the preferred items for a user. Moreover, combining different social network analysis metrics can come out onto more accurate results. Another important aspect is to take into account more than the most valuable node for doing recommendations. It will be also interesting to store the obtained matrices for each venue and update them with every check-in. By now matrices are not stored, so they are not dynamic, which means that a new matrix is built every time the user checks in a venue, overwriting any other previous matrix. Finally, it becomes necessary to test the solution among a higher number of users, increasing their tastes possibilities and the offered items. We are limited by our environment because of the small amount of users and check-ins available. Therefore a dissemination of this work would be very useful to get more real data. REFERENCES [1] Jama : A java matrix package. [2] Movielens movies recommendation service. [3] Using foursquare data to redefine a neighborhood. [4] G. Adomavicius and A. Tuzhilin, Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions, Knowledge and Data Engineering, IEEE Transactions on, 17(6), , (2005). [5] S. Baluja, R. Seth, D. Sivakumar, Y. Jing, J. Yagnik, S. Kumar, D. Ravichandran, and M. Aly, Video suggestion and discovery for youtube: taking random walks through the view graph, in Proceedings of the 17th international conference on World Wide Web, pp ACM, (2008). [6] P.L. Chen, C.T. Tsai, Y.N. Chen, K.C. Chou, C.L. Li, C.H. Tsai, K.W. Wu, Y.C. Chou, C.Y. Li, W.S. Lin, et al., A linear ensemble of individual and blended models for music rating prediction, in KDDCup 2011 Workshop, (2011). [7] M. Claypool, A. Gokhale, T. Miranda, P. Murnikov, D. Netes, and M. Sartin, Combining content-based and collaborative filters in an online newspaper, in Proceedings of ACM SIGIR Workshop on Recommender Systems, pp Citeseer, (1999). [8] J. Davidson, B. Liebald, J. Liu, P. Nandy, T. Van Vleet, U. Gargi, S. Gupta, Y. He, M. Lambert, B. Livingston, et al., The youtube video recommendation system, in Proceedings of the fourth ACM conference on Recommender systems, pp ACM, (2010). [9] H. Kautz, B. Selman, and M. Shah, Referral web: combining social networks and collaborative filtering, Communications of the ACM, 40(3), 63 65, (1997). [10] I. Konstas, V. Stathopoulos, and J.M. Jose, On social networks and collaborative recommendation, in Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pp ACM, (2009). [11] A.N. Langville, C.D. Meyer, and P. FernÁndez, Google s pagerank and beyond: The science of search engine rankings, The Mathematical Intelligencer, 30(1), 68 69, (2008). [12] G. Linden, B. Smith, and J. York, Amazon. com recommendations: Item-to-item collaborative filtering, Internet Computing, IEEE, 7(1), 76 80, (2003). ECAI-12 Workshop on Preference Learning: Problems and Applications in AI 26

33 [13] P. Melville, R.J. Mooney, and R. Nagarajan, Content-boosted collaborative filtering for improved recommendations, in Proceedings of the National Conference on Artificial Intelligence, pp Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, (2002). [14] M.E.J. Newman, The mathematics of networks, The New Palgrave Encyclopedia of Economics, 2, (2008). [15] A.T. Nguyen, N. Denos, and C. Berrut, Improving new user recommendations with rule-based induction on cold user data, in Proceedings of the 2007 ACM conference on Recommender systems, p , (2007). [16] P. Resnick and H. R. Varian, Recommender systems, Communications of the ACM, 40(3), 56 58, (1997). [17] B. Sarwar, G. Karypis, J. Konstan, and J. Reidl, Item-based collaborative filtering recommendation algorithms, in Proceedings of the 10th international conference on World Wide Web, p , (2001). [18] A.I. Schein, A. Popescul, L.H. Ungar, and D.M. Pennock, Methods and metrics for cold-start recommendations, in Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, p , (2002). ECAI-12 Workshop on Preference Learning: Problems and Applications in AI 27

34 Preprocessing Algorithm for Handling Non-Monotone Attributes in the UTA method Alan Eckhardt 1 and Tomáš Kliegr 2 Abstract. UTA is a method for learning user preferences originally developed for multi-criteria decision making. UTA expects that the input attributes are monotone with respect to preferences, which limits the applicability of the method and requires manual input for each attribute. In this paper, we propose a heuristic attribute preprocessing algorithm that transforms arbitrary input attributes into a space approximately monotone with respect to user preferences, thus making it suitable for UTA. In an experimental evaluation on several real-world datasets, preprocessing the input data with the proposed algorithm consistently improved the results in comparison with the UTA-ADJ variation of the UTA STAR algorithm. 1 Introduction The UTA (UTilités Additives) method (Jacquet-Lagreze [9]) learns an additive piece-wise linear utility model. Although a relatively old approach, it is used as a basis for many recent utility-based preference learning algorithms, e.g. [8]. UTA takes a set of alternatives ordered according to user s preferences, and learns utility functions for each attribute. Using these functions, the utility for individual attribute values are combined into the overall utility for a given object. UTA expects that the input attributes are monotone with respect to preferences, which not only requires manual input for each attribute, but also limits the applicability of the method. In this paper, we propose a heuristic attribute preprocessing algorithm that transforms arbitrary input attributes into a space approximately monotone with respect to user preferences, thus making it suitable for UTA. This paper is divided into four sections. Section 2 describes the UTA method with focus on its monotonicity constraints. The proposed algorithm for learning local preferences is introduced in Section 3. Experimental evaluation of the algorithm is presented in Section 4, and the concluding remarks are in Section 5. 2 Related Work UTA method [9] draws inspiration from the way humans handle decision making tasks at a level of abstraction provided by the microeconomic utility theory. The principal inductive bias used in the UTA method is the monotonicity of utility functions. The UTA method also incorporates additional assumptions, particularly piece-wise linearity and additivity of utility functions. UTA method aims at inferring one or more additive value functions from a given ranking (weak ordering) on the reference set X of 1 Department of Software Engineering, Charles University in Prague, Czech Republic, eckhardt@ksi.mff.cuni.cz 2 Department of Information and Knowledge Engineering, University of Economics in Prague, Czech Republic, and Multimedia and Vision Group, Queen Mary University, London, UK. tomas.kliegr@vse.cz objects 3. Each object is described by N criteria G = {g 1,..., g N }. The evaluation scales of each criterion is given by the real-valued function g i : X [g i, g i ]. The value g i is considered the worst level of the worth of the object from the decision maker s point of view with respect to criterion g i and g i the best level, g i(x) denotes the evaluation of object x X in criterion g i. The method uses linear programming to find such N partial value functions u i that best explain given preferences. The overall preference rating for an object x is computed as a sum of utility values across all criteria: U(x) = N u i(x) i=1 where x is an object and u i are nondecreasing marginal value functions, which we call partial utility functions. It should be noted that the partial utility functions are piece-wise linear. For each criterion, the interval [g i, gi ] is cut into α i 1 equal intervals, gi ]. This discretization is not significant for nominal criteria; for these, the number of breakpoints α i (including the end points g i, gi ) corresponds to the number of values of the criteria. However, for cardinal criteria, the marginal value of an [g i, g 2 i ],..., [g α i 1 i object g i(x) [g j i, gj+1 i between the two nearest breakpoints g j i, gj+1 ] is approximated using linear interpolation i. The condition of the monotonicity requires that if for x, x X, object x is weakly preferred to object x, then for another object x X, such that g h (x ) g h (x), for all h = 1,..., N, object x should be also weakly preferred to object x [2]. In the area of Multi-Criteria Decision Making (MCDM), it is the role of the expert to derive the set of criteria from the properties of the object. The expert not only checks that the criteria meet the monotonicity requirement, but also orders the domain so that value g i is considered the worst and value g i the best for criterion i. If UTA is to be applied in wider machine learning context, rather than in the narrow area of decision support systems, a manual approach to transforming attributes to criteria is not feasible. Kliegr in [10] proposed a non-monotonic extension of the UTA Star algorithm, called UTA-NM, which allows u i to change direction from ascending to descending, thus allowing to use criteria 4 that do not meet the monotonicity requirement. Every change of the direction within one partial utility function is penalized to ensure that the resulting model is not overly complex and overfitted to the data. In this paper, we investigate another option for dealing with nonmonotonicity, which is based on an automatic transformation of the original, potentially non-monotonic attributes of the input objects 3 Usually referred to as actions or alternatives in the UTA literature. 4 Since the notion of criterion is closely tied with the monotonicity requirement, it would be more precise to speak of attributes. ECAI-12 Workshop on Preference Learning: Problems and Applications in AI 28

35 into criteria. The algorithm is heuristic, i.e. the resulting criteria are not guaranteed to meet the condition of monotonicity. The performance of this preprocessing step is evaluated against a baseline obtained with a non-monotonic UTA implementation coming out of UTA-NM. 3 Heuristic Algorithm for Transformation of Attributes to Criteria Modern preference learning research deals directly with properties of input objects, an object x is typically described by a vector of attribute values (x 1,..., x N ), no special requirements on the attribute values are typically made [7]. In contrast, the original UTA model, and MCDA in general, abstracts from the original attributes of the object. It is already assumed that the expert has used these attributes to construct the criteria set G; each object is represented by criteria values (g 1(x),..., g N (x)). The individual criteria are assumed to meet several properties, including the condition of monotonicity introduced in the previous section. 5 In this section, we present a heuristic algorithm that transforms the original values of the attributes, so that the result better meets the key requirement put on criteria addressed in this paper the condition of monotonicity. Since the input for the algorithm is the preference rating of one user 6, we call it local preference transformation. Using D Ai to denote the domain of the original attribute i, the local preferences transformation can be viewed as function f i : D Ai [g i, gi ], which is a concretization of the criterion function g i : X [g i, gi ] in the formal UTA model. Each object x originally described by attribute values x 1,..., x N is now described by values f 1(x 1),..., f N (x N ). The intuition is that f i(x i) is an average of ratings of the particular user across objects that have value x i in attribute i, we denote this average as r(x). Since the preference information which is consumed by the UTA method is in the form of weak order of alternatives rather than ratings, for the purpose of the local transformation algorithm this weak order needs to be transformed to ratings. A straightforward approach is to assign values in the [0, 1] range corresponding to the position of the object in the weak preference order, with the best ranked object retrieving the highest rating 1 and the worst ranked object the lowest ranking 0, with the remaining ranks equidistantly distributed in the [0, 1] range. The definition of the transformation function f i depends on the type of the input attribute, which can be either nominal or cardinal. In our present work ordinal attributes are not addressed, however, they can be cast to integer and handled as a cardinal data type. In the remainder of this section, we discuss definition of the local transformation function f i first for nominal and then for cardinal attributes. Figure 1 illustrates the computation on the notebook choice problem. Ratings of three hypothetical users with respect to the nominal notebook colour attribute are presented. For each user and colour, a frequency distribution of ratings is shown. Let us focus on one value only, e.g. black. Black-coloured notebooks are highly preferred for users 1 and 2. The vertical line in the histogram represents the average rating of each object with colour black - for both users, the average is close to 1. The average will be used as the representative rating of the value black for the given user, or in other words, preference of black colour. Formally, the set of black objects is denoted as {x x colour = black} and the sum of ratings of black objects is then r(x) {x x colour =black}. The resulting representative ratings do not necessarily maintain the monotonicity of ratings. Consider five objects (given in the order of increasing stated preference by the user) x 1, x 2, x 3, x 4, x 5 with two attributes colour and price. The description of these objects with respect to these attributes is given by Table 1. Once the stated preference, originally given in terms of stars, is converted to a numerical rating, the original attributes can be transformed with the local preferences function obtaining f colour (black) = and f colour (red) = 0.58, f cpu(intel) = 0.25 and f cpu(amd) = 0.5, f cpu(motorola) = 1.0. This example can be also used to illustrate the fact that the result of the local preferences transformation is not guaranteed to not violate the condition of monotonicity: x 4 is preferred to x 3, for x 2 it holds that g h (x 2) g h (x 4), for all h = 1,..., N, therefore x 2 should also be weakly preferred to x 3. Since this is not satisfied, the condition of monotonicity is violated. Table 1. Notebook choice example nominal attributes. object x 1 x 2 x 3 x 4 x 5 stated order * ** *** **** ***** rating original attribute values colour black red red black red cpu intel amd intel amd motorola transformed values (criteria) colour cpu Nominal Attributes Representants is a method for finding preferences over nominal attributes, which we proposed in [4]. It is based on the analysis of the distribution of ratings. The local preference function for a nominal attribute A i has the following form: {x x f i(a) = r(x) i =a}, {x x i = a} where a is an attribute value. 5 In this paper, we make the assumption that the original N input attributes are converted to the same number of criteria. 6 Decision maker in the MCDA terminology. 3.2 Cardinal Attributes This section describes the proposed approach for transformation of cardinal (numerical) attributes. Linear regression is a very useful method that finds a relation in a data set in the form of a linear function. In the local preferences transformation, univariate linear regression is used to find a relationship between each cardinal input attribute and the rating as the dependent variable. Expanding our notebook example, consider additional notebook price attribute. Applying the local transformation for cardinal attributes, notebook price is used as a regressor and the ECAI-12 Workshop on Preference Learning: Problems and Applications in AI 29

36 Number of objects 100 Number of objects 100 Number of objects 100 Orange Atribute domain Green Red Black Blue 0 Uniform ratings Rating 1 Expressive ratings Orange Atribute domain Green Red Black Blue 0 Rating 1 Orange Atribute domain Green Red Black Blue 0 Rating 1 Figure 1. Illustrative ratings of three users for the notebook colour attribute. user rating as the dependent variable. The result of the regression is a linear function of the form: f price(price) = α price + β. If the underlying attribute is of so called cost type, i.e. user preference decreases with increasing attribute value, a direct use of the attribute value x i in place of the criterion value g i(x) would violate the condition that the value g i is considered the worst level of the worth of the object from the decision maker s point of view with respect to criterion g i and gi the best level (Figure 2b). Linear regression will generally solve this problem, rearranging the values appropriately. 1 MPix fu 1 Mpix Price [$] fu 1 1 f Price U Figure 2. a) c) 1 fu 1 b) d) Four basic types of preference over cardinal domains However, the weakness of linear regression is that it finds only linear orderings, therefore it does not generally fix a violation of the condition monotonicity. Figure 2 shows four typical types of preferences over cardinal attributes. Only two of them are linear. Linear regression is not able to find hill and valley types. These types can be learned using our method Peak proposed in [5]. We have considered also other possibilities for finding preferences over cardinal attributes. In our preliminary experiments on small training sets, using quadratic regression consistently worsened the results. 4 Experiments 4.1 Performance measures We will use the following notation: r(o) is the user s rating of object o, r(o) is the rating predicted by the method. Tau coefficient expresses the similarity of two ordered lists L 1, L 2. In our case, the first list L 1 is ordered according to the user s rating and the second list L 2 according to the rating estimated by the method. The lists are sorted in decreasing order, so that the most preferred objects are on the top of the lists. In the simplest case, the lists consist only of ids of objects. The tau coefficient is then computed according to the number of concordant pairs. A pair of objects o,p is concordant, if either o is before p in both lists, or p is before o in both lists. A pair that is not concordant, is discordant. Then, tau coefficient can be computed as follows: τ(l 1, L 2) = n c n d 1/2 n (n 1), where n c is the number of concordant pairs and n d is the number of discordant pairs. δ in following formula stands for Kronecker delta - δ(condition) = 1 if condition is true, 0 otherwise. n c = o,p X δ(sgn(l 1(o) L 1(p)) = sgn(l 2(o) L 2(p))) Another measure expressing the similarity of two lists is Pearson correlation. This measure is often used in machine learning for studying the dependency of two variables. If the correlation is high, the two lists are similar, if the correlation is low, the lists are different. The value of the correlation coefficient ranges from -1 to 1. Value -1 means lists with exactly inverse ordering, 1 corresponds to the same ordering. Correlation 0 means there is no connection between the two orderings. corr(r, r) = (r(o) r)( r(o) r) o X (n 1)s rs r, where r is the average rating, r is the average predicted rating, s r is the standard deviation of the original ratings and s r is the standard deviation of the predicted ratings. UTA method is focused on preserving the order of objects rather than estimating their ratings, so it does not make sense to use the Root Mean Square Error (RMSE) metric, which is commonly used for other methods. Two measures of computational performance will be studied. The first measure is the time required to train the method on a set of ranked objects. The second one is the time required to evaluate a testing object. ECAI-12 Workshop on Preference Learning: Problems and Applications in AI 30

37 Table 2. Datasets with monotone class attribute from the UCI repository. Dataset name Car Evaluation Contraceptive Method Choice Nursery Poker Hand Post-Operative Patient Teaching Assistant Evaluation Wine Wine Quality[1] 4.4 Experimental Results This section describes our experiments on the UCI datasets. The results are averaged across all the datasets listed in Table 2. For the given training set size, 20 different (randomized) training sets were used for each dataset. We compare UTA-ADJ with the (possibly) improved UTA+Local, which is also an UTA-ADJ run, but involving the local preferences transformation. The maximum number of changes in shape threshold was set to 2 for all experiments Datasets UCI [6] contains a number of datasets with different properties. We were most concerned in classification tasks, where there is an ordering of classes. For evaluation of the UTA method, only datasets suitable for the object ranking task were relevant. E.g. the famous Iris dataset were excluded, because we are not able to order the three classes - Iris Setosa, Iris Versicolour and Iris Virginica - in a meaningful way. Every classification dataset of UCI was studied for the presence of monotonicity in the class attribute. The chosen datasets from UCI repository are in Table 2. We acknowledge that using UCI for validation of user preference learning methods may not give representative results, since these are not real-world preference datasets. 4.3 Experimental Implementation For the experimental evaluation, we used our implementation of nonmonotonic UTA called UTA-ADJ. It is a based on similar ideas as the UTA-NM algorithm, introduced in our earlier work [10]. UTA-NM removes the monotonicity constraints imposed by the UTA Star algorithm. A penalization element is added to prevent overfitting by excessive number of changes in shape of partial utility functions, however, the penalization entails an excessive computational cost. To remedy the computational issue, UTA-ADJ takes a different route to allow non-monotonicity in partial utility functions. UTA- ADJ runs the standard UTA Star algorithm multiple times, gradually testing the influence of placing a change of shape at individual breakpoints across all partial utility functions. The breakpoint in which the change of shape yields the largest increase in the objective function in comparison with the baseline, is retained. This procedure is repeated until the preset threshold of maximum number of changes in shape is retained or the improvement in the objective function is lower than a preset minimum increase. In the first iteration, the baseline is the objective value attained by a fully monotonic run of the UTA Star algorithm, in subsequent iterations it is the best objective value from the previous iteration. The computational costs of multiple runs of the UTA Star algorithm is in our experience much lower than the cost of a single UTA- NM run with the penalization constraints. Nevertheless, it should be noted that the computational advantage of UTA-ADJ over UTA-NM is inversely related to the maximum number of changes in shape threshold. UTA-ADJ algorithm closely resembles the method proposed by Despotis and Zopounidis [3], with the difference that breakpoints, where the change of shape occurs, are not externally set parameters, but are determined automatically. Also, more changes of shape per partial utility functions are allowed. The UTA-ADJ implementation is available as an interactive online application at Tau coefficient Figure 3. Correlation Figure Train set size UTA UTA+Local Tau rank coefficient for all datasets Train set size UTA UTA+Local Correlation for all datasets. Figure 3 presents a comparison of performance on individual datasets in terms of the tau coefficient. Here, the advantage of using local preferences is clear, the improvement of the value of tau coefficient for moderate train size is around A comparable result for Pearson correlation can be observed in Figure 4. The relative increase in correlation and tau coefficient is about 28%. The results are convincing - using local preferences with the UTA- ADJ variant of the UTA method significantly improved its performance in all performance measures compared to UTA-ADJ only baseline. Moreover, the time to train the classifier has also decreased. All results were significant at the level p < Conclusion We proposed a preprocessing algorithm called local preferences transformation, which allows to use the UTA method with nonmonotone attributes. ECAI-12 Workshop on Preference Learning: Problems and Applications in AI 31

38 The experimental results confirm the benefits of the proposed approach as a preprocessing step for UTA-ADJ, a variant of the UTA method, which already has some adjustments to handle nonmonotone attributes. We assume the effectiveness of our local preferences preprocessing algorithm to hold also for the common UTA Star algorithm. Nevertheless, an experimental evaluation of this hypothesis is a priority for future work. A promising direction for extending our research is using more elaborate methods than linear regression for preprocessing cardinal attributes. Regarding nominal attributes, further work should be directed at investigation of circumstances, in which the proposed algorithm is and is not effective in maintaining the condition of monotonicity. It would also be interesting to find out the relation between the degree of monotonicity of the data and the performance of UTA with/without applying the proposed local preferences transformation in the preprocessing phase. The UTA method implementation used in the experiments is available in a form of an interactive web application at vse.cz/uta. We plan to make the local transformation algorithm available to the community by integrating it with this software. Acknowledgment We would like to thank the reviewers for their comments, which helped to improve the quality of this paper considerably. This research was supported by projects SVV , MŠM and grant GAČR P202/10/0761. REFERENCES [1] Paulo Cortez, António Cerdeira, Fernando Almeida, Telmo Matos, and José Reis, Modeling wine preferences by data mining from physicochemical properties, Decision Support Systems, (June 2009). [2] Krzysztof Dembczyński, Wojciech Kotłowski, Roman Słowiski, and Marcin Szelag, Learning of rule ensembles for multiple attribute ranking problems, in Preference Learning, eds., Johannes Fürnkranz and Eyke Hüllermeier, , Springer-Verlag, (2010). [3] D.K. Despotis and Constantin Zopounidis, Building additive utilities in the presence of non-monotonic preference, in Advances in Multicriteria Analysis, eds., P.M. Pardalos, Y. Siskos, and C. Zopounidis, , Kluwer Academic Publisher, Dordrecht, (1993). [4] Alan Eckhardt and Peter Vojtáš, Considering data-mining techniques in user preference learning, in 2008 International Workshop on Web Information Retrieval Support Systems, pp , (2008). [5] Alan Eckhardt and Peter Vojtáš, Learning user preferences for 2cpregression for a recommender system, in SOFSEM 10: Proceedings of the 36th Conference on Current Trends in Theory and Practice of Computer Science, pp , Berlin, Heidelberg, (2010). Springer- Verlag. [6] A. Frank and A. Asuncion. UCI machine learning repository, [7] Johannes Fürnkranz and Eyke Hüllermeier, Preference Learning, Springer, [8] Salvatore Greco, Vincent Mousseau, and Roman Slowinski, Ordinal regression revisited: multiple criteria ranking using a set of additive value functions, European Journal of Operational Research, 191(2), , (December 2008). [9] E. Jacquet-Lagreze and Yannis Siskos, Assessing a set of additive utility functions for multicriteria decision-making, the uta method, European Journal of Operational Research, 10(2), , (June 1982). [10] Tomáš Kliegr, UTA - NM: Explaining stated preferences with additive non-monotonic utility functions, in Proceedings of Workshop Preference Learning in ECML/PKDD 09, (September 2009). ECAI-12 Workshop on Preference Learning: Problems and Applications in AI 32

39 Learning from Pairwise Preference Data using Gaussian Mixture Model Mihajlo Grbovic and Nemanja Djuric and Slobodan Vucetic 1 Abstract. In this paper we propose a fast online preference learning algorithm capable of utilizing incomplete preference information. It is based on a Gaussian mixture model that learns soft pairwise label preferences via minimization of the proposed soft rank loss measure. Standard supervised learning techniques, such as gradient descent or Expectation Maximization can be used to find the unknown model parameters. Algorithm outputs are soft pairwise label preference predictions that need to be further aggregated to produce a total label ranking prediction, for which several existing algorithms can be used. The main advantages of the proposed learning algorithm are the ability to process a single training instance at a time, low time and space complexity, ease of implementation, and model reuse. 1 INTRODUCTION Label Ranking is emerging as an important and practically relevant preference learning field. Unlike the standard problems of classification and regression, label ranking learning is a complex learning task, which involves the prediction of strict label order relations, rather than single values. Specifically, in the label ranking scenario, each instance, which is described by a set of features x, is assigned a ranking of labels π, that is a total (e.g. π = (3, 5, 1, 4, 2)) or partial (e.g. π = (5, 3, 2)) order over a finite set of class labels Y (e.g. Y = {1, 2, 3, 4, 5}). The label ranking problem consists of learning a model that maps instances x to a total label order h : x n π n. It is assumed that a sample from the underlying distribution D = {(x n, π n), n = 1,..., N }, where x n is a d-dimensional feature vector and π n is a vector containing a total or partial order of a finite set Y of L class labels, is available for training. This problem has recently received a lot of attention in the machine learning community and has been extensively studied [6, 4, 9, 3, 16]. A survey of recent label ranking algorithms can be found in [8]. There are many practical applications in which the objective is to learn an exact label preference of an instance in form of a total order. For example, in the case of document categorization, where it is very likely that a document belongs to multiple topics (e.g. sports, entertainment, baseball, etc.), one might not be interested only in predicting which topics are relevant for a specific document, but also to rank the topics by relevance. Additional applications include: metalearning [18], where, given a new data set, the task is to induce a total rank of available algorithms according to their suitability based on the data set properties; predicting food preferences for new costumers based on the survey results, demographics, and other characteristics of respondents [12]; determining an order of questions in 1 Temple University, Department of Computer and Information Sciences, Philadelphia, USA, {mihajlo.grbovic, nemanja.djuric, slobodan.vucetic }@temple.edu a survey for a specific user based on respondent s attributes. A recent publication [14], suggests clustering of label ranking data, which could be of great practical importance, especially in target marketing. There are three principal approaches for label ranking. The first decomposes label ranking problem into one or several binary classification problems. Ranking by pairwise comparison (PW) [11], for example, creates L (L 1)/2 classification problems, one for each possible pairwise ranking. Pairwise binary classifier predictions are aggregated into a total label order by voting. The constraint classification (CC) [9], on the other hand, transforms the label ranking problem into a single binary classification problem by augmenting the data set, such that each example x is mapped into L (L 1)/2, (d L)-dimensional, examples. This allows for training of a single classifier. The second approach is to use utility functions, where the goal is to learn mappings f k : X R for each label k = 1,..., L, which assign a value f k (x) to each label, such that f i(x) < f j (x) if x prefers label j over i. For example, the label ranking method proposed in [6] represents each f k as a linear combination of base ranking functions. The utility functions are learned to minimize the number of ranking errors and the final rank is produced simply by ranking the utility scores. It should be noted that the utility function-based approach is also popular in the related object ranking problem, where techniques based on SVM [10] and AdaBoost [7] have been proposed. The third approach is represented by a collection of algorithms which use probabilistic approaches for label ranking, such as the ones that rely on the Mallows [13] and the Plackett-Luce (PL) [15] models. A typical representative is the instance-based (IB) label ranking [4, 3]. Given a new instance x, the k-nearest Neighbor algorithm is used to locate its neighbors in the feature space. Then, the neighbors label rankings are aggregated to provide prediction. Rank aggregation for prediction is not a trivial task, particularly in presence of partial label ranks. Mallows [4] and Plackett-Luce [3] models that describe probability distribution of rankings have been used to come up with the optimization criterion for rank aggregation. As an alternative, CPS probabilistic model for rank aggregation has been recently proposed [16]. Instance-based label ranking algorithms are simple and intuitive. Furthermore, they have been shown to outperform the competitors in various label ranking scenarios. However, their success comes at a large cost associated with both memory and time. First, they require that the entire training data set is stored in memory, which can be costly or even impossible in the resource-constrained applications. Storing the original data can also raise privacy issues as the data might contain sensitive user information. Second, the prediction involves costly nearest neighbor search and aggregation of neighbors label rankings. The aggregation is slow as it requires using op- ECAI-12 Workshop on Preference Learning: Problems and Applications in AI 33

40 timization techniques at prediction time, such as iterative Minorization Maximization(IB-PL) or exhaustive search (IB Mallows). In this paper we propose an online, time- and memory-efficient algorithm for learning label preferences based on the Gaussian Mixture Model (GMM), which could be attractive because of an intuitively clear learning process and ease of implementation. The model preserves privacy as it consists of mixtures defined by prototypes which are not the actual data points. Every prototype is associated with preference judgments for each pair of labels. Unlike many competitors, our algorithm is not limited to a specific type of label ranking and could support various ranking structures (bipartite, multipartite, etc). For an unlabeled instance, GMM predicts the soft label preferences by averaging prototypes pairwise preferences according to distances. These soft label preferences in form of a preference matrix need to be aggregated further, into a total order of labels. This is a well known problem in preference learning and is an especially popular research area in the object ranking scenario, where numerous methods have been proposed [5, 1, 2]. 2 PRELIMINARIES In the label ranking scenario, a single instance, described by a d- dimensional vector x, is associated with a total order of assigned class labels, represented as a permutation π of the set {1,..., L}, where L is the number of available class labels. We define π such that π(i) is the class label at i-th position in the order, and π 1 (j ) is a position of the y j class label in the order. The permutation can also describe incomplete ranking {π(1),..., π(k)} {1,..., L}, k < L. In our approach, instead of the total order, we use a zero-diagonal preference matrix Y. When a preference between labels y i and y j exists, we set Y(i, j ) > Y(j, i) if y i is preferred over y j and Y(i, j ) < Y(j, i) otherwise, Y(i, j ) + Y(j, i) = 1, for i, j {1,..., L}. A value of Y(i, j ) which is close to 1 is interpreted as a strong preference that y i should be ranked above y j. A typical approach is to assign Y(i, j ) = 1 and Y(j, i) = 0 if y i x y j. Similarly, uncertain (soft) preferences can be modeled by using values lower than 1. For example, indifferences (ties) are represented by setting Y(i, j ) = Y(j, i) = 0.5. In case of non-existing, incomparable or missing preferences, both Y(i, j ) = 0 and Y(j, i) = 0. This representation allows us to work with complete and partial label orders, as well as with pairwise preferences with uncertainties and indifferences. Finally, bipartite and multi-partite label rankings could be handled as well. Evaluation metrics. Let us assume that N historical observations are collected in a form of a data set D = {(x n, Y n), n = 1,..., N}. The objective in all scenarios is to train a ranking function h : x n ˆπ n from data set D that outputs a total label order. In the Label Ranking scenario, to measure the degree of correspondence between true and predicted rankings for n-th example, π n and ˆπ n respectively, it is common to use the Kendall s tau distance d n = {(y i, y j) : πn 1 (y i) > πn 1 (y j) ˆπ n 1 (y j) > ˆπ n 1 (y i)}. To evaluate a label ranking model, the label ranking loss on the data set D is defined as the average normalized Kendall s tau distance, loss LR = 1 N N n=1 2 d n L (L 1). (1) Note that the measure simply counts the number of discordant label pairs and reports the average over all considered pairwise rankings. Given the general preference matrix representation, assuming binary matrix predictions Ŷn, we can rewrite (1) as loss P = 1 N N n=1 Y n Ŷn 2 F L (L 1), (2) where F is Frobenius matrix norm. Indeed, for each example n, the square of the Frobenius norm sums up to double the number of discordant label pairs. For models with soft label preference predictions Ŷ n, e.g., Ŷ n(i, j) = 0.7, Ŷn(j, i) = 0.3, loss (2) can be interpreted as a soft version of (1). We can solve the preference learning task in two stages. In the learning stage, function f : x n Y n is learned via minimizing (2). In the aggregation stage, given the model predictions in a form of Y n, the total order prediction π n is computed using a preference aggregation mapping g : Y n π n. In the next section we show the details of the proposed Gaussian Mixture Model algorithm to be used in the learning stage. Existing algorithms such as [5, 1, 2], can be used in the aggregation stage. 3 GAUSSIAN MIXTURE MODEL FOR LABEL RANKING The GMM model for label ranking is completely defined by a set of K mixtures, i.e., prototypes {(m k, Q k ), k = 1,..., K}, where m k is a d-dimensional vector in input space and Q k is the corresponding preference matrix. First, we introduce the probability P(k x) of assigning observation x to k-th prototype that is dependent on their (Euclidean) distance. Let us assume that the probability density P(x) of x can be described by a mixture model, P(x) = K P(x k) P(k), (3) k=1 where K is the number of prototypes, P(k) is the prior probability that a data point is generated by k-th prototype, and P(x k) is the conditional probability that k-th prototype generates particular data point x. Let us represent the conditional density function P(x k) with the normalized exponential form P(x k) = θ(k) exp(f(x, m k )) and consider a Gaussian mixture with θ(k) = (2πσ 2 p) 1/2 and f(x, m k ) = x m k 2 /2σ 2 p. We assume that all prototypes have the same standard deviation σ p and the same prior, P(k) = 1/K. Given this, using the Bayes rule we can write the assignment probability as P(k x) = exp( x m k 2 /2σp) 2. (4) K exp( x m u 2 /2σp) 2 u=1 To derive a cost function, we propose the following mixture model for the posterior probability P(Y x), P(Y x) = K P(k x) P(Y k). (5) k=1 Based on this model, example x is assigned to the prototypes probabilistically and its preference matrix is a weighted average of the prototype preference matrices. The mixture model assumes the conditional independence between x and Y given k, P(Y x, k) = P(Y k). For P(k x) we assume the Gaussian distribution from ECAI-12 Workshop on Preference Learning: Problems and Applications in AI 34

41 (4). For the probability of generating a preference matrix Y by prototype k, P(Y k), we also assume Gaussian error model with mean (Y Q k ) and standard deviation σ y. The resulting cost function l(λ) can be written as the negative log-likelihood, l(λ) = 1 N N K ln P(k x) N (Y Q k, σy), 2 (6) n=1 k=1 where λ = {m k, Q k, k = 1,..., P, σ p, σ y} are the model parameters. For the compactness of notation, let us define g nk = P(k x n) and e nk = N (Y n Q k, σy). 2 It is important to observe that, after proper normalization, (6) reduces to (2) if examples are assigned to prototypes deterministically. Therefore, it can be interpreted as its soft version. If prototype matrices Q k consisted of only 0 and 1 entries (hard label preferences) (6) further reduces to (1). The objective is to estimate the unknown model parameters, namely prototype positions m k and their preference matrices Q k, k = 1,..., K. This is done by minimizing the cost function l(λ) with respect to the parameters. This can be achieved in several different ways. If online learning capability is a requirement, one can use the stochastic gradient descent method and obtain the learning rules by calculating derivatives l(λ)/ m k and l(λ)/ Q k for k = 1,..., K. This results in following rules for n-th training example, m n+1 k Q n+1 k = m n k α(n) (Ln e nk) g nk (x n m k ) L n σp 2 = Q n k α(n) e nk g nk, (Y n Q k ) L n σy 2 where L n = P k=1 (g nk e nk ) and α(n) is the learning rate. The resulting model has complexity O(NKL). Otherwise, the problem can straightforwardly be mapped into Expectation- Maximization (EM) framework following the procedure from [17]. Initialization is done by selecting the first P training points as the initial prototypes. If any Q k prototype preference matrix obtained in such manner contains empty elements, they are replaced with 0.5 entries, as the corresponding labels will initially be treated equally. GMM model generalizes the training data to produce a representation in terms of prototype vectors and effectively utilizes distances to prototypes as a similarity measure to calculate the predicted label rank. When compared to IB-based algorithms, GMM is a more global model, that aggregates over more data, thus also alleviating influence of noise. Therefore, it is expected to outperform IB-based algorithms, whose performance is highly dependent on the quality of the training data and the presence of outliers, since no abstraction is made during the training phase. We could try to aggregate over more data by considering a large number of neighbors in IB algorithms, however, by doing so we start ignoring distances between the query instance and its neighbors as a similarity measure. This remains to be seen after a proper experimental evaluation. A disadvantage of the GMM model is that it requires aggregation of the predicted label preference matrix to produce a total order of labels. Luckily, most of the existing algorithms have low complexity, e.g. O(L log L) for QuickSort [1]. In our future work we plan to evaluate the pros and cons of different aggregation methods. 4 CONCLUSION AND FUTURE WORK We introduced an idea of a Gaussian Mixture Model algorithm for Label Ranking. The main advantages of the new method are: (1) it is capable of operating in an online manner, (2) it is memory-efficient since it operates on a predefined budget, (3) it preserves privacy, (4) (7) it could potentially reuse the model when new labels are introduced. There are several avenues which need to be pursued further: (1) experimental evaluation of the proposed method on benchmark data (2) determining the optimal number of prototypes K using statistical learning theory, (3) low rank approximation of pairwise preference matrices to reduce memory requirements, (4) evaluating different preference matrix aggregation algorithms, (5) applying the algorithm to clustering of label ranking data. In the future work we plan to address these issues. REFERENCES [1] Nir Ailon and Mehryar Mohri, An Efficient Reduction of Ranking to Classification, in COLT, pp Omnipress, (2008). [2] Maria-Florina Balcan, Nikhil Bansal, Alina Beygelzimer, Don Coppersmith, John Langford, and Gregory B. Sorkin, Robust Reductions from Ranking to Classification, in COLT, volume 4539 of Lecture Notes in Computer Science, pp , (2007). [3] Weiwei Cheng, Krzysztof Dembczyński, and Eyke Hüllermeier, Label ranking methods based on the Plackett-Luce model, in Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp , (2010). [4] Weiwei Cheng, Jens Hühn, and Eyke Hüllermeier, Decision tree and instance-based learning for label ranking, in Proceedings of the 26th International Conference on Machine Learning (ICML-09), pp , (2009). [5] William W. Cohen, Robert E. Schapire, and Yoram Singer, Learning to order things, Journal of Artificial Intelligence Research, 10, , (1999). [6] Ofer Dekel, Christopher Manning, and Yoram Singer, Log-Linear Models for Label Ranking, in Advances in Neural Information Processing Systems, volume 16, MIT Press, (2003). [7] Yoav Freund, Raj D. Iyer, Robert E. Schapire, and Yoram Singer, An Efficient Boosting Algorithm for Combining Preferences, Journal of Machine Learning Research, 4, , (2003). [8] Thomas Gärtner and Shankar Vembu, Label Ranking Algorithms: A Survey, in Preference Learning, ed., Eyke Hüllermeier Johannes Fürnkranz, Springer Verlag, (2010). (to appear). [9] Sariel Har-Peled, Dan Roth, and Dav Zimak, Constraint classification for multiclass classification and ranking, in Proceedings of the 16th Annual Conference on Neural Information Processing Systems, NIPS- 02, pp MIT Press, (2003). [10] R. Herbrich, T. Graepel, and K. Obermayer, Large Margin Rank Boundaries for Ordinal Regression, in Advances in Large Margin Classifiers, pp MIT Press, (2000). [11] Eyke Hüllermeier, Johannes Fürnkranz, Weiwei Cheng, and Klaus Brinker, Label ranking by learning pairwise preferences, Artificial Intelligence, 172, , (2008). [12] Toshihiro Kamishima and Shotaro Akaho, Efficient Clustering for Orders, in ICDM Workshops, pp IEEE Computer Society, (2006). [13] C. L. Mallows, Non-null ranking models, Biometrika, 44, , (1967). [14] Grbovic Mihajlo, Nemanja Djuric, and Slobodan Vucetic, Supervised clustering of label ranking data, SIAM International Conference on Data Mining, (2012). [15] R. L. Plackett, The analysis of permutations, Applied Statistics, 24(2), , (1975). [16] Tao Qin, Xiubo Geng, and Tie-Yan Liu, A New Probabilistic Model for Rank Aggregation, in Advances in Neural Information Processing Systems 23, , MIT Press, (2010). [17] Weigend A. S., Mangeas M., and Srivastava A. N., Nonlinear Gated Experts for Time Series: Discovering Regimes and Avoiding Overfitting, International Journal of Neural Systems, 6, , (1995). [18] Ricardo Vilalta and Youssef Drissi, A perspective view and survey of meta-learning, Artificial Intelligence Review, 18, 7795, (2002). ECAI-12 Workshop on Preference Learning: Problems and Applications in AI 35

42 Preference Function Learning over Numeric and Multi-valued Categorical Attributes Lucas Marin 1, Antonio Moreno and David Isern Abstract. 1 One of the most challenging goals of recommender systems is to infer the preferences of users through the observation of their actions. Those preferences are essential to obtain a satisfactory accuracy in the recommendations. Preference learning is especially difficult when attributes of different kinds (numeric or linguistic) intervene in the problem, and even more when they take multiple possible values. This paper presents an approach to learn user preferences over numeric and multi-valued linguistic attributes through the analysis of the user selections. The learning algorithm has been tested with real data on restaurants, showing a very good performance. 1 Introduction Nowadays it is practically unconceivable to select our summer holiday destination or to choose which film to see in the cinema this weekend without consulting specialized sources of information in which, in some way or another, our preferences can be specified to aid the system to recommend us the best choices. That is because we live in an era where there are so many data easily available that it is impossible to manually filter every piece of information and evaluate it accurately. Recommender Systems (RS) have been designed to do this time-consuming task for us and, by feeding them with information about our interests, they are capable enough to tell us the best alternatives for us in a personalized way. A RS stores the preferences of the user about the values of some criteria and uses this information to rate and sort a corpus of alternatives. The management of the preferences, the accuracy of the recommendations, and how these interests evolve over time are three of the most challenging tasks of these type of systems [8]. Concerning the first goal, RSs may obtain feedback from a user implicitly, explicitly or combining both approaches. This paper discusses an unsupervised way to infer the user interests, which observes the user interaction and does not require any explicit information from him [4]. The criteria used to describe the alternatives may have different natures. Some works propose the use of ontologies to represent concepts within a hierarchy [2,9]. Other researchers use fuzzy logic techniques to deal with linguistic criteria [1,10], and many approaches only consider numerical criteria [5]. This paper considers the use of linguistic and numerical data, which permit a high degree of expressivity and can be applied in a wide range of domains. The basic idea is to use the preferences to sort a set of alternatives, show this ordered list to the user, and observe his final selection. With this information, the preference learning algorithm is able to modify the user profile so that it captures better the user preferences and the next recommendation is more accurate. The rest of the paper is organized as follows. Section 2 includes a brief explanation of the related work the authors conducted in the area of preference learning over linguistic and numeric attributes, explaining how the interests of users over certain attributes or criteria are managed and learned. Section 3 explains a new approach to manage categorical attributes when they can take multiple linguistic values in a single alternative. Section 4 describes how a more expressive function which defines the behaviour of the preference over numeric attributes can be automatically learned. In Section 5 the case study where our approach has been tested (restaurant recommendation) is explained, describing the data set used and the results obtained. Finally, Section 6 gives the main conclusions of the paper and identifies some lines of future research. 2 Preference learning over categorical and numerical attributes When we face a decision problem in which we require the aid of a RS to help us make a choice, all of the possible alternatives to said problem are defined, in most of the cases, by the same attributes. In this work we focus only on categorical and numeric attributes. The following subsections explain how preferences over the two different kinds of attributes are expressed, how alternatives are evaluated and ranked, and how the user interests are learned and adapted from his selections. 2.1 Attributes and management of preferences In a recent work ([7]) we proposed to represent the level of interest over categorical attributes by using a linguistic scale in which preference labels are defined as fuzzy sets representing values of preference such as Very Low, Low, Medium, High or Very High (see Figure 1). 1 Department of Computer Science and Mathematics, Universitat Rovira i Virgili, Tarragona, Catalonia (Spain). addresses: lucas.marin@urv.cat, antonio.moreno@urv.cat, david.isern@urv.cat. ECAI-12 Workshop on Preference Learning: Problems and Applications in AI 36

43 Figure 1. Example of a linguistic preference set For the case of numeric attributes, we assumed that each user has a preference function for each attribute. This function has a triangular shape (see ) and is defined as x v pa ( x) 1 pref where p a (x) is the preference of the value x of the attribute a, and is the width of the function, which we considered to be 10% of the attribute domain. Figure 2. Basic numeric preference function 2.2 Alternatives evaluation When evaluating an alternative, the objective is to aggregate all of the values of all of the attributes into a single value. Since we have two kinds of attributes, a conversion to the same domain is made. In our approach, we chose to translate the numerical preferences to linguistic ones. The translation is done by, first, calculating the value of preference of a certain numeric attribute value by using Eq. (1). Then that value is mapped to the fuzzy linguistic labels domain and matched with the label with a higher value in that point. When all the attributes have been assigned a value of preference using the same fuzzy linguistic scale, all the terms are aggregated using the ULOWA aggregation operator [3]. The final result of this aggregation is the value of preference assigned to the whole alternative, used to rank the alternatives. 2.3 Preference learning When the ranked alternatives are presented to the user, two things can happen: (a) the user selects the first ranked alternative or (b) the user selects any other alternative. The first case means that the recommendation process has worked accurately, since the system gave the first place to the selected alternative. However, in the second case, there were other alternatives (which we call over ranked) that were considered by the system as better than the one the user finally selected. Thus, that is probably indicating that the (1) information that we have in the user profile is not accurate enough and should be modified. In a nut shell, the main intuition behind the user profile change algorithm is that we should increase the preference on the attribute values present in the selected alternative and decrease the preference on the attribute values appearing in the over ranked alternatives. The information required to infer this reasoning is extracted from what is called relevance feedback. In this case, it consists in the over ranked alternatives and the selected one. Numerical and categorical attributes are managed in different ways, as described in the following subsections Linguistic preference adaptation The main idea is to find attribute values repeated among the over ranked alternatives that do not appear on the selection, which will be the candidates for having his preference decreased. Similarly, the preference of the attribute values that appear on the selection and do not appear often on the over ranked alternatives is likely to be increased. The interested reader may find a more detailed explanation of the process of adaptation of linguistic preferences in [7]. The profile adaptation is conducted by two processes. The first one called on-line adaptation is executed every time the user asks the system for a recommendation, and it evaluates the information that can be extracted from the current ranked set of alternatives. The main goals of this stage are to decrease the preference of the attribute values that are causing non-desired alternatives to be given high scores and to increase the preference of the attribute values that are important for the user but are not well judged on the basis of the current user profile. For each recommendation made by the system, two sources of information are evaluated: the selected alternative, which is the choice made by the user, and the alternatives that were ranked above it. Values extracted from the over-ranked alternatives haver their level of preference decreased whereas the ones extracted from the user s final selection that do not appear in the set of over-ranked alternatives have their preference increased. The second one called off-line adaptation is triggered after the recommender system has been used a certain number of times. It considers the information given by the history of the previous rankings of alternatives and the selections made by the user in each case, but considers that information separately. When the system faces cases in which the number of over ranked alternatives is not large enough for reliable characteristics to be extracted, it stores the small number of over ranked alternatives in a temporary buffer. After several iterations in which the number of over ranked alternatives has been insufficient for evaluation, the system will have recorded enough alternatives to start evaluating them. When there are enough saved over-ranked alternatives, the values in their attributes will be analysed and their preference decreased. Moreover, user selections are also stored, and after a certain number of choices have been made, they are evaluated with the objective to increase the preference of the most repeated attribute values, since their repeated selection indicates that the user is really interested in them. ECAI-12 Workshop on Preference Learning: Problems and Applications in AI 37

44 2.3.2 Numeric preference learning The numeric adaptation of the user profile presented in [6] is inspired by Coulomb s Law: the magnitude of the electrostatics force of interaction between two point charges is directly proportional to the scalar multiplication of the magnitudes of charges and inversely proportional to the square of the distances between them. The main idea is to consider the value stored in the profile (current preference) as a charge with the same polarity as the values of the same criterion on the over ranked alternatives, and with opposite polarity to the value of that criterion in the selected alternative. Thus, the value of the profile is pushed away by the values in the over ranked alternatives and pulled back by the value in the selected alternative. Two stages have been considered in the adaptation algorithm. The first one, called on-line adaptation process, is performed each time the user asks for a recommendation. The other stage, called off-line process, is performed after a certain amount of interactions with the user. Figure 3. Attraction and repulsion forces For the on-line stage, the information available in each iteration is the user selection and the set of over ranked alternatives. In order to calculate the change of the value of preference in the user profile for each criterion it is necessary to study the attraction force done by the selected alternative and the repulsion forces done by the over ranked ones in each criterion, as represented in the example in Figure 3, in which the j-th value of the five over ranked alternatives o 0, o 1, o 2, o 3, and o 4 causes a repulsion force F o j, and the value for the same criterion of the selected alternative, s j, causes an attraction force F s j. Both forces are applied on the j-th value of the profile, P j. The attraction force F s done by the selected alternative for each attribute j is defined as 1 sj P j s j if s j P j F ( P, s, j, ) s s j Pj j P j 0 if s j Pj In this equation, j is the range of the criterion j, s j is the value of the criterion j in the selected alternative and P j is the value of the same criterion in the stored profile P. The parameter α adjusts the strength of the force in order to have a balanced adaptation process. The repulsion force exerted by the over ranked alternatives for each criterion j is defined as a generalization of Eq.(2) as follows: o 1 no F P, o,..., o, j, no i 1 Pj oj i i i 1 P P j o j j oj Finally, both forces are summed up and the resulting force is calculated. The techniques designed for the on-line stage fail at detecting user trends over time since they only have information of a single selection. The off-line adaptation process gathers information from several user interactions. This technique allows considering changes in the profile that have a higher reliability than those (2) (3) proposed by the on-line adaptation process, because they are supported by a larger set of data. The off-line adaptation process can be triggered in two ways: the first one evaluates the user choices, while the second one analyses the over ranked alternatives discarded by the user in several iterations. The possibility of running the off-line process (in any of its two possible forms) is checked after each recommendation. In the first case, the system has collected some alternatives selected by the user in several recommendation steps, and it calculates the attraction forces (F s ) exerted by each of the stored selected alternatives over the values stored in the profile, using an adaptation of Eq. (2), that has as inputs the profile P, the past selections {s 1,,s rs }, the criterion to evaluate j, and the strength-adjusting parameter α.: ' 1 s rs F P, s,..., s, j, The second kind of off-line adaptation process evaluates the set of over ranked alternatives that have been collected through several iterations and which were not used in the on-line adaptation process (because it did not have enough over ranked alternatives in a single iteration). When the stored over ranked alternatives reach a certain number, the off-line adaptation process calculates the repulsion forces over the profile values exerted by those alternatives (F o ), which are calculated using Eq.(3). 3 Multi-valued linguistic attributes rs i 1 sj Pj i i i 1 s s j P j j Pj As explained in Section 2, in our previous work we considered categorical attributes that could take only one linguistic value (For example, a city could have a single value in the Climate attribute). However, there are cases in which it is interesting to consider multiple values. One example of that situation could be the attribute Types of food in a restaurant: a restaurant can have the values { Asian, Seafood, Vegetarian } while another can have only Italian food. Extending our model so that it can manage lists of categorical values implies addressing two issues: how to represent and calculate the user preferences over the attribute taking into account all of the values and how to adapt dynamically those preferences. 3.1 Preference value on multi-valued attributes When there is an attribute with multiple values a procedure should be defined to decide which single linguistic preference represents better the whole set of linguistic values. Going back to the restaurant example, if a user has a High preference over Asian food restaurants and a Low preference over Rice dishes, we can argue that the preference we could assign to the Type of food attribute in a restaurant with both values should be Medium (an average of the two kinds). If another restaurant only offers Asian food then its preference should be High, so this restaurant would have a higher ranking than the first one. The rationale of this procedure is that it seems more adequate to reward the alternatives that are more focused in the aspects the user really likes. This example represents an average preference aggregation policy, however, other policies can also be considered depending on the attribute definition. (4) ECAI-12 Workshop on Preference Learning: Problems and Applications in AI 38

45 3.2 Preference learning on multi-valued categorical attributes The linguistic algorithm used to adapt categorical preferences explained in Section 2 needs some improvements to be able to manage lists of values. When single-valued attributes were considered, the user selection pointed directly towards the value the user liked for that attribute. Now, however, we cannot be sure which one/s of the values listed in the attribute is/are the one/s of interest for the user. That is the reason why it has been necessary to design a relevance function which indicates how relevant is a value found among the over ranked alternatives or in the selected alternative. Relevance is measured in a [0,1] scale, with 1 meaning maximum relevance. To calculate how relevant a term t of the attribute j is among the over ranked alternatives we use this expression (the relevance value is 0 if it does not appear in the over ranked alternatives): nt o 1 1 Rj () t (5) i no i 1 nv j Here, no represents the number of over ranked alternatives, nt the number of over ranked alternatives where t appears, and nv i j the number of values that appear for the attribute j in the alternative i. In this equation we consider that every linguistic term that appears in the over ranked alternatives has a relevance which is inversely proportional to the number of other values for the same attribute that appear among the entire set of over ranked alternatives. To calculate the relevance of a term in the selection we use: 1 s 1 nl (6) Rj () t 2 nv j tv Here nv j represents the number of values that appear for the attribute j in the selection, nl the total number of linguistic attributes, and tv the total number of linguistic values that appear in the selection. The relevance of a term in the selection is the mean between the importance of the term among the values that appear with it in the same attribute and the importance of each linguistic term that appears in the selection compared with the number of linguistic attributes. Finally, after calculating both partial relevancies for all the terms, the overall relevance R j (t) is calculated as: s o R ( ) ( ) ( ) (7) j t Rj t Rj t In conclusion, considering a threshold γ to avoid making changes in the profile with low relevance, it can be deduced that: If R j (t)>γ, the preference over term t for the attribute j needs to be increased (moved to the next term). If R j (t)<γ, the preference over term t for the attribute j needs to be decreased (moved to the previous term). 4 Learning preference functions for numeric attributes Although the numeric preference learning approach described in Section 2 provided an adequate way of learning the ideal value of preference over a numeric attribute, it was unable to learn all of the parameters that model the preference function such as the slope or the width, which were fixed. The new learning method presented in this section relies on historic data about the user selections to approximate the preference function of the numeric attributes to the most adequate one. With this approach, we have a new definition of the function of preference which now has 5 parameters (left and right slope, left and right width, and value of preference) instead of just the value of preference: ml x vpref 1 if ( x vpref ) l pa ( x) 1 if ( x vpref ) mr x vpref 1 if ( x vpref ) r In this expression p a (x) is the preference of the value x of the attribute a, m l and m r are the function slope values (for the left and right sides of the triangle, respectively) and l and r are the parameters which define the width of the function (also for the left and right sides of the triangle, respectively). An example of graphical representation of a preference function can be seen in Figure 4, where the left slope is a value under 1, the right slope is a value over 1, and the left width is greater than the right one. Figure 4. Numeric preference function with 5 parameters The whole process of adapting the numeric preference function is depicted in Figure 5. function PREF-FUNC-ADAPTATION( V(v 0,,v n ), //historic of values of past selections v pref, //value of maximum preference v min, //minimum numeric value v max, //maximum numeric value ti, //trust interval s //probability distribution sampling) begin B=getBestValues(V, v pref, ti); PD=calculateProbabilityDistribution(B, v min, v min, s); {left,right}=calculatedelta(pd); m{left,right}=calculatebestslope(pd, v pref, ); PreferenceFunction=(, m, v pref ); return PreferenceFunction; end; Figure 5. Preference function learning algorithm The first step consists in obtaining the more reliable values from the historic set of selections. This is done by extracting a percentage of the values closer to the value of preference (trust interval), normally of 90%. With that we avoid considering outlier values. Then a probability distribution function, represented with a histogram, is calculated with those best values. The sample or (8) ECAI-12 Workshop on Preference Learning: Problems and Applications in AI 39

46 discretization step is a parameter, normally around 1% of the domain range. Delta values are then calculated by observing the width of the probability distribution. For example, if the first value different to 0 in the histogram is 3 and the last is 56, and the value of higher preference (v pref ) is 34, l would be 31 and r would be 22. Afterwards, the algorithm generates preference functions with different combinations of values for the slope values (m) (in the range from 0 to 4 in steps of 0.2), and compares the distance between each preference function and the probability distribution. The function with the lower distance shows the chosen slope. Finally, the new preference function is built with the new delta and slope values. 5 Case study: restaurant recommendation In order to test our new approach to multi-valued attribute evaluation and numeric preference function learning, we have used data of the restaurants in Barcelona to implement a RS with the ability to learn the users interests from their selections. In the first part of this section a description of the data is given. Then, a basic explanation of the whole recommender and learning algorithm is given, as well as the preferences setup. Finally, the results of the evaluation are provided. 5.1 Barcelona restaurants data The data used in this problem has been collected from the BcnRestaurantes web page 2. The data set contains information about 3000 restaurants of Barcelona evaluated by 5 attributes: 3 categorical ( Type of food - 15 values, Atmosphere - 14 values, Special characteristics 12 values) and 2 numerical ( Average price, Distance to city center ). One example of register in the data file is Fonda España; National, Season cuisine, Traditional; Classic, For families; Round tables, In a hotel, With video; 45; 0.979, being Fonda España the restaurant name, National, Season cuisine and Traditional the types of food served, Classic and For families the restaurant atmosphere, Round tables and In a hotel other important restaurant characteristics, 45 the average menu price, and km the distance to the city centre. 5.2 Recommendation and adaptation The set of 3000 restaurants has been divided in blocks of 15 alternatives that are ranked independently, which gives out a total of 200 different recommendations. An ideal profile was manually defined and three initial profiles were created randomly. The goal is to learn the ideal profile starting from these three different points. In this evaluation the preferences over the categorical attributes are represented with a linguistic label term set of 7 values, which are Very Low, Low, Almost Low, Medium, Almost High, High and Very High. The whole process (for each of the three profiles, repeated 200 times) consists in: 1. Ranking a set of 15 alternatives according the current (initially random) profile. 2. Simulate the selection of the user by choosing the alternative that fits better the ideal profile. 3. Extract relevance feedback from the selection (over ranked alternatives and the selection itself). 4. Decide which changes need to be made to the current profile and apply them. Some information about the whole process is stored after each iteration, including the position of the selected alternative, the distance between the ideal and current profiles, and the preferences over linguistic and numeric values. 5.3 Results evaluation In order to evaluate the results of the new learning techniques, a distance function has been defined to calculate how different the profile we are learning is to an ideal profile which represents the exact preferences of the user. The first step is to calculate the distance for each attribute, taking into account if it is numeric or categorical. The distance between numeric attributes is calculated as c i d( n, c, i) 1 p ( ) (9) n vpref n where n is the numerical attribute, c is the current profile (the one c i being learned), i is the ideal profile, and pn ( v pref n) is the value of preference of the v pref value for the attribute n in i using the preference function of the same attribute in the profile c. A distance 0 means that the v pref values in both profiles are equal. The equation to calculate the distance between categorical attributes is card () l 1 CoG( pl ( vk )) CoG( pl ( vk )) d( l, c, i) card ( l) CoG( s ) CoG( s ) (10) where l is the categorical attribute, card(l) is the cardinality of the attribute l (i.e., the number of different linguistic values it can c i take), CoG( pl ( vk)) and CoG( pl( vk)) are the x-coordinate of the centres of gravity of the fuzzy linguistic labels associated to the value of preference of v k in the profiles c and i, respectively, and CoG( smin ) and CoG( s max ) are the centres of gravity of the minimum and maximum labels of the domain, respectively. Finally, the distance between two profiles is calculated as na k 1 1 D( c, i) d( k, c, i) na k 1 (11) where na is the total number of attributes. During the three tests (one for each initial random profile) the distance between the adapting and the ideal profile has been calculated in each iteration. Figure 6 (continuous line) shows the average of the three distances. It can be seen that the initial average distance between the ideal and the adapting profiles is around After 200 iterations it reaches a distance around 0.1. Although 200 iterations may seem a large number, it can also be observed that with only 50 iterations a very acceptable result of 0.2 is obtained. To see to what extent the new approach to learn the numeric preference function explained in Section 4 has improved the result of our previous work (commented in Section 2), Figure 6 also compares the results with and without (dashed line) that functionality. It can be seen how the improvement has been noticeable (distance improvement of about 0.07). c min i max 2 Last access May 30th, ECAI-12 Workshop on Preference Learning: Problems and Applications in AI 40

47 The second contribution, which is learning the numeric preference function, allows shaping a more expressive and personalised representation of user preferences over each numeric attribute, defining a preference function with 5 parameters. This additional expressivity helped to improve the profile learning process by reducing the learning error around 7%. As a future work, two interesting lines can be considered. As pointed out in Section 3, an aggregation policy can be considered in the aggregation of the preferences in a single attribute, other than the use of the common average policy. Research can be made in this area in order to learn the aggregation policy that fits more the user interests. Another interesting line to consider is to incorporate information about the numeric preference function in the distance measure used to evaluate the algorithm since, currently, just the value of preference is being considered. Figure 6. Average distance between current and ideal profile To wrap up the results evaluation, Figure 7 shows in what position the user selection is being ranked by the RS on each of the iterations in the first test (the three give similar results). This figure shows the results in a more intuitive way. Notice that the system is accurate if the selected alternative is in the first positions of the 15- items list in each iteration. Many factors can interfere in the process and make the learning of the exact ideal profile a very hard task, but if the user selection appears in the first positions, we can consider that the learning process is working properly. As it can be observed in Fig.7, after about 50 iterations, the selected alternative is among the first three ones in 95% of the cases (and the first one in around 70% of the cases). Figure 7. Position of the selected alternative in each iteration (test 1) 6 Conclusions and future work Two main contributions with respect to our previous work have been presented in this paper. The first one consists in managing multi-valued categorical attributes in the alternatives of a RS, allowing. more expressivity in their representation. The system considers a single preference for each possible value and aggregates them to find out the preference over the whole attribute. The consideration of multi-valued attributes is mandatory when working with alternatives such as the ones presented in this paper (e.g. Type(s) of Food in a restaurant alternative). ACKNOWLEDGEMENTS This work has been supported by the Universitat Rovira i Virgili (a pre-doctoral grant of L. Marin) and the Spanish Ministry of Science and Innovation (DAMASK project, Data mining algorithms with semantic knowledge, TIN ) and the Spanish Government (Plan E, Spanish Economy and Employment Stimulation Plan). REFERENCES [1] G. Castellano, C. Castiello, D. Dell'Agnello, A. M. Fanelli, C. Mencar, M. A. Torsello, Learning Fuzzy User Profiles for Resource Recommendation, International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 18 (4) (2010), [2] I. Garcia, L. Sebastia, E. Onaindia, On the design of individual and group recommender systems for tourism, Expert Systems with Applications 38 (6) (2011), [3] D. Isern, L. Marin, A. Valls, A. Moreno, The Unbalanced Linguistic Ordered Weighted Averaging Operator, in: IEEE International Conference on Fuzzy Systems, FUZZ-IEEE 2010, IEEE Computer Society, Barcelona, Catalonia, 2010, [4] G. Jawaheer, M. Szomszor, P. Kostkova, Comparison of Implicit and Explicit Feedback from an Online Music Recommendation Service, in: 1st International Workshop on Information Heterogeneity and Fusion in Recommender Systems, ACM Press, Chicago, US, 2010, [5] T. Joachims, F. Radlinski, Search Engines that Learn from Implicit Feedback, Computer 40 (8) (2007), [6] L. Marin, D. Isern, A. Moreno, Dynamic adaptation of numerical attributes in a user profile, International Journal of Innovative Computing Information and Control (2012). [7] L. Marin, D. Isern, A. Moreno, A. Valls, On-line dynamic adaptation of fuzzy preferences, Information Sciences doi: /j.ins (2012). [8] M. Montaner, B. López, J. L. de La Rosa, A taxonomy of recommender agents on the internet, Artificial Intelligence Review 19 (4) (2003), [9] A. Moreno, A. Valls, D. Isern, L. Marin, J. Borràs, SigTur/E- Destination: Ontology-based personalized recommendation of Tourism and Leisure Activities, Engineering Applications of Artificial Intelligence doi: /j.engappai (2012). [10] C. Porcel, A. G. López-Herrera, E. Herrera-Viedma, A recommender system for research resources based on fuzzy linguistic modeling, Expert Systems with Applications 36 (3, Part 1) (2009), ECAI-12 Workshop on Preference Learning: Problems and Applications in AI 41

48 Direct Value Learning: a Preference-based Approach to Reinforcement Learning David Meunier (1), Yutaka Deguchi (2), Riad Akrour (1) Einoshin Suzuki (2), Marc Schoenauer (1), Michele Sebag (1) Abstract. Learning by imitation, among the most promising techniques for reinforcement learning in complex domains, critically depends on the human designer ability to provide sufficiently many demonstrations of satisfactory quality. The approach presented in this paper, referred to as DIVA (Direct Value Learning for Reinforcement Learning), aims at addressing both above limitations by exploiting simple experiments. The approach stems from a straightforward remark: while it is rather easy to set a robot in a target situation, the quality of its situation will naturally deteriorate upon the action of naive controllers. The demonstration of such naive controllers can thus be used to learn directly a value function, through a preference learning approach. Under some conditions on the transition model, this value function enables to define an optimal controller. The DIVA approach is experimentally demonstrated by teaching a robot to follow another robot. Importantly, the approach does not require any robotic simulator to be available, nor does it require any pattern-recognition primitive (e.g. seeing the other robot) to be provided. 1 Introduction Since the early 2000s, significant advances in reinforcement learning (RL) have been made through using direct expert s input (inverse reinforcement learning [18], learning by imitation [10], learning by demonstration [16]), assuming the expert s ability to demonstrate quasi-optimal behaviors and to provide an informed representation. In 2011, new RL settings based on preference learning and allegedly less demanding for the expert have been proposed (more in section 2). In this paper, a new preference-based reinforcement learning approach called DIVA (Direct Value Learning for Reinforcement Learning) is proposed. DIVA aims at learning directly the value function from basic experiments. The approach is illustrated on the simple problem of teaching a robot to follow another robot. It is shown that DIVA yields a competent follower controller without requiring the primitive I see the other robot to be either provided, or explicitly learned. 2 State of the art Reinforcement learning is most generally formalized as a Markov decision process. It involves a state space S, an action space A, and an upper bounded reward function r defined on the state space r : S R. The model of the world is given by the transition function p(s,a,s ), expressing the probability of arriving in state s on making action a in state s under the Markov property; in the deterministic case, the transition function tr : S A S gives the 1 TAO, CNRS-INRIA-LRI, Université Paris-Sud 2 Dept. Informatics, ISEE, Kyushu University state tr(s,a) of the agent upon making action a in state s. A policy π : (S,A) R maps each state in S on some action in A with a given probability. The return of policy π is defined as the expectation of cumulative reward gathered along time when selecting the current action after π, where the initial state s 0 is drawn after some probability distributionq ons. Denotinga h π(s h ) the random variable action selected by π in state s h, s h+1 p(s h,a h,s) the state of the agent at step h + 1 conditionally to being in state s h and selecting action a h at step h, and r h+1 the reward collected in s h+1, then the policy return is [ ] J(π) = IE π γ h r h s 0 q h=0 where γ < 1 is a discount factor enforcing the boundedness of the return, and favoring the reaping of rewards as early as possibly in the agent lifetime. The so-called value function V π(s) estimates the expectation of the cumulative reward gathered by policy π when starting in state s, recursively given as: V π(s) = r(s)+γ a,s π(s,a)p(s,a,s )V π(s ) Interestingly, from a value function V can be derived a greedy policy π V, provided the transition function is known: when in states, select the action a leading to the state with best expected value: { p(s,a,s )V(s ) probabilistic trans. π V(s) = argmax a A V(tr(s, a)) deterministic trans. (1) By construction, π Vπ is bound to improve on π. By learning value function V as the maximum over all policies π of V π V (s) = max πv π(s) one can thus derive the optimal policy π = π V. The interested reader is referred to [19, 20] for a comprehensive presentation of the main approaches to Reinforcement Learning, namely value iteration and policy iteration algorithms, building a sequence of value functions V i and policies π i converging to V and π. The bottleneck of estimating the optimal value function is that all states must be visited sufficiently often, and all actions must be triggered in any state, in order to enforce the convergence of V i andπ(v i ) towardv andπ. For this reason, RL algorithms hardly scale up when the size of the state and action spaces is large, all the more so as the state description must encapsulate every information relevant to action selection in order to enforce the Markov property. New RL approaches devised to alleviate this bottleneck and based on preference learning have been proposed in 2011 [11, 2]. [11] is concerned with the design of the reward function in order to facilitate RL; for instance in the medical protocol application domains, } ECAI-12 Workshop on Preference Learning: Problems and Applications in AI 42

how to associate a numerical negative reward to the patient s death? The authors thus extend the classification-oriented approach first proposed by [17] as follows.

49 how to associate a numerical negative reward to the patient s death? The authors thus extend the classification-oriented approach first proposed by [17] as follows. In a given state s, an action a is assessed by executing the current policy until reaching a terminal state (rollout). On the basis of these assessments, pairs of actions can be ranked with regard to state s and policy π (a < s,π a ). These ranking constraints are exploited within a learning-to-rank algorithm (e.g. RankSVM [13]), yielding a ranking hypothesis h s,π : A R. The claim is that such action ranking hypotheses are more flexible than classification hypotheses, aimed at discriminating the best actions from the others conditioned by the current state. In summary, the ranking hypothesis depends on the policy and the current state, and operates on the action space. Quite the contrary, in [2] the ranking hypothesis operates on the policy space. The motivating application is swarm robotics, facing two severe issues. Firstly, swarm robotics is hardly compatible with generative model-based RL approaches; simulator-based approaches suffer from the supra-linear computational complexity of simulations w.r.t. the number of robots in the swarm (besides the simulation noise). Secondly, swarm robotics hinders the inverse reinforcement learning approach [1, 15], using the expert demonstrations to learn a reward function. In most cases the swarm expert cannot describe (let alone demonstrate) the individual robot behavior leading to the desired swarm behavior (known as the inverse macro-micro problem [7]). The proposed approach, called PPL (Preference-based Policy Learning) proceeds along an interactive optimization setting: the robot(s) demonstrates a behavior, which is ranked by the expert comparatively to the previous best demonstration. The ranking constraints are exploited through a learning-to-rank algorithm, yielding a ranking hypothesis on the policy demonstrations and thus on the policy space Π (h : Π R). This ranking hypothesis is used as policy return estimate, casting RL as an optimization problem (find π h = argmax h(π)). Policy π h is demonstrated to the expert, who ranks it compared to the previous best demonstration, and the process is iterated. Note that PPL thus faces the same difficulty as interactive optimization at large [9, 22]: if the expert is presented with too constrained a sample of demonstrations, she does not have a chance to teach her preferences to the system. PPL is thus extended to integrate an active learning criterion, yielding the APRIL (Active Preference learning-based Reinforcement Learning) algorithm [3]. 3 Overview of DIVA This section introduces and formalizes the principle of DIVA, and discusses its strengths and limitations w.r.t. the state of the art. 3.1 Principle DIVA is rooted in Murphy s law (Anything that can possibly go wrong, does). Formally, it posits that when the agent happens to be in some good situation, its situation tends to deteriorate under most policies. Let us illustrate this idea on the simple problem of having a robot following another robot in an open environment. Assume that the follower robot is initially situated behind the leader robot (Fig. 1, left, depicts the follower state, given as its camera image). Assume that both follower and leader robots are equipped with the same simple Go Ahead controller (same actuator value on the left and right wheel of both robots). Almost surely, each robot trajectory will deviate from the straight line, due to e.g. the imperfect calibration of the wheel actuators or different sliding frictions on the ground. Almost surely, the two robot trajectories will be deviated in a different way. Therefore, the follower will at some point lose track of the leader (Fig. 1, right, depicts the follower state after circa 52.7 (+-20.6) time steps, that is, XX seconds). Figure 1: Left: The follower robot is initially aligned behind the leader robot. Right: Both leader and follower robots are operated by the same Go ahead controller. Due to mechanical drift, the follower sooner or later loses track of the leader. The intuition can be summarized as: the follower state was never as good as in the initial time step; it becomes worse and worse along time. 3.2 Formalization The above remarks enable to define a ranking hypothesis on the state space, as follows. Let us consider K trajectories of the robot follower noted S 1...S K, where each trajectory S i is defined as a sequence of statess (i) t,t = 0...T i. A value function is sought as a functionv mapping the set of states S onto R, satisfying constraints V(s (i) t ) > V(s (i) t+1 ) for all i = 1...K and t = 0...Ti 1. Formally, it is assumed in the following that the state space S is embedded in R d. A linear value function V : S R is defined as V (s) = ŵ,s where ŵ R d is given after the standard learning-to-rank regularized formulation [4]: ŵ = arg min 1 2 w 2 2 +C k i=1 t<t T i ξ (i) t,t s.t. 1 i K, 0 t < t T i w,s (i) t w,s (i) t +1 ξ (i) t,t ; ξ (i) t,t 0 This quadratic optimization under constraints problem can be solved with affordable empirical complexity [14]. After Eq. (1) and provided that the transition model is known, value function V derives a policy π. Further, by construction any value function derived by monotonous transformation of V induces the same policy π. 3.3 Discussion Among the main inspirations of the DIVA approach is TD-Gammon [21]. TD-Gammon, the first backgammon program to reach a champion level in the 80s, exploits games generated from self-play to train a value function along a temporal difference algorithm. The value function is likewise trained from a set of games, or trajectories S i described as a sequence of positions s (i) t,t = 0...T i. The difference is as follows. Firstly, TD-Gammon only imposes the value for the initial and the final positions, with V(s (i) 0 ) = 1/2 (the initial position is neutral), and V(s (i) T i ) = 1 (respectively 0) if the first player wins (resp., loses) the game. Secondly, the learning problem (2) ECAI-12 Workshop on Preference Learning: Problems and Applications in AI 43

50 is regularized through a total variation minimization (minimizing Ti 1 i t=1 (V(s(i) t ) V(s (i) t+1 ))2 ). A key difference between DIVA and TD-Gammon regards the availability of a simulator. If a generative model (a simulator) is available for the problem domain, then indeed RL can rely on cheap and abundant training data. In robotics however, simulator-based training is prone to the so-called reality gap [8]. Generally speaking, the robotic framework is hardly compatible with data-intensive learning approaches, due to the poor accuracy of simulator-based data on the one hand, and the cost (experimenter time and robot fatigue) of insitu experiments on the other hand. A consequence is that TD-Gammon can exploit high quality simulated data whereas DIVA relies on a limited amount of data. Further, these data are experimentally acquired and should require little or no expertise from the human experimenter. Along the same line, TD- Gammon only prescribes the values attached to the initial and final state of each trajectory; the bulk of learning relies on the regularization term. Quite the contrary, DIVA exploits short trajectories and sets comparison constraints on the values attached to each time step (besides using regularization too). Finally, TD-Gammon requires all winning/losing terminal states to be associated the same value; there is no such constraint in DIVA, as states from different trajectories are not comparable. Both approaches suffer from a same limitation: policyπ V is computed from value function V iff the transition function is known or well approximated (Eq. (1)). Further, finding the optimal action requires one to solve an optimization problem in each time step, which usually requires the action space to be small. It will be seen (section 4.2) that a cheap estimate of the transition function can alleviate the above limitations in the robotic framework in some cases. An additional limitation of DIVA is that the value function V is learned from a set of trajectories obtained from a naive controller, which does not necessarily reflect the target (test) distribution, thus raising a transfer learning problem [6]. This issue will be discussed in section 5. 4 Experimental validation A proof of principle of DIVA is given in a robotic framework (section 4.1), based on a delayed transition model estimate (section 4.2) and a continuity assumption (section 4.3). The experimental setting and goals of experiments are described in section 4.4 and results are reported and discussed in section Framework The robotic platform is a Pandaboard, driven by the dual-core ARM Cortex-A9 OMAP4430, with each core running at 1 GHz, and equipped with 1 GB DDR2 RAM. The robot is equipped with a USB camera with resolution ( ), and color depth of monochrome 8bit. All experiments are done in-situ, taking place in a real laboratory full of tables, chairs, feet and various (moving) obstacles. Although no model of the world (transition function or simulator) is available, a cheap estimate thereof can be defined (see below). The primary description of the robot state is given by its camera image (in R ). A pre-processing step aimed at dimensionality reduction and inspired from the SIFT (scale-invariant feature transform) descriptors is used and operated on the Pandaboard using the Linux OS. More precisely, SURF (speeded up robust feature) descriptors are used, claimed to be several times faster and more robust against different image transformations than SIFT [5]. Finally, each block of d d contiguous pixels in the initial image is represented by an integer, the number of SURFs occurring in the block (d = 1,2,4,16 in the experiments, with d = 4 the best empirical setting, and the only considered in the remainder of the paper). The state space S is finally included in R Delayed transition model estimate In the following, the action space A includes three actions: Ahead, Right and Left. The extension to richer and more gradual action spaces is left for further work. As already mentioned, the definition of a controller from a value function requires a transition model to be available (Eq. (1)). Two working assumptions are done to overcome the lack of an accurate transition model for the considered open world. Firstly, a delayed transition model estimate is defined as follows. Let s t 1 and s t denote the states at time t 1 and t respectively, and assume that the action selected at timet 1 is Ahead. Then, the idea is that if the robot had turned right at time t 1, it would have seen approximately the same image as ins t, but translated on the left; additionally, some new information would have been recorded on the rightmost camera pixels while the information on the leftmost pixels would have been lost. In other words, given s t = tr(s t 1, Ahead) one can compute an approximation of tr(s t 1, Right), by a circular shift of s t. Let state s t be described as a pixel matrix s t[i,j] where i = 1...n w, j = 1...n h and n w (respectively n h ) is the width (resp. height) of the camera image. Then, tr(s t 1, Right) {s t[i+ (mod n w),j]} where is a constant translation width (64 pixels in the experiments; this will be relaxed in further work, section 5). Note that the unknown rightmost pixels intr(s t 1, Right) are arbitrarily replaced by the leftmost pixels in s t; the impact of this approximation will be discussed further. More generally, given s t = tr(s t 1,a), a delayed transition model estimate is given by tr(s t 1,a a,s t = tr(s t 1,a)) = {s t[i+l(a,a ) (mod n w),j]} where the translation width l(a,a ) is (resp. ) if a= Ahead, a =Right (resp. Left), and completed by consistency (Table 3). a a l(a,a ) a a l(a,a ) Ahead Right Ahead Left Left Ahead Right Ahead Left Right 2 Right Left 2 (3) This estimate is delayed as it is only available at time t, since tr(s t 1,a a,s t = tr(s t 1,a)) is computed from s t. To some extent, this estimate can cope with partially observable environments (another robot of the swarm might arrive in sight of the current robot at t, while it was not seen at time t 1). On the other hand, the estimation error is known to be concentrated in the peripheral pixels. 4.3 Continuity assumption Given a value function V, the action a t 1 which has been selected att 1 and the current states t, the delayed transition model estimate defines what would have been the best action a t 1 that should have been selected at time t 1instead of a t 1, after Eq. (1): } a t 1 = argmax a A { V ( tr(s t 1,a a t 1,s t)) (4) ECAI-12 Workshop on Preference Learning: Problems and Applications in AI 44

Figure 2: A lesion study: recording irrelevant states at the beginning of the follower trajectory (the white board on the left and the author feet on the right).

Along this line, the controller defined from V and noted π selects at timetactiona t 1 as defined by Eq. (4).

51 Figure 2: A lesion study: recording irrelevant states at the beginning of the follower trajectory (the white board on the left and the author feet on the right). The continuity assumption posits that the environment changes gracefully, implying that the action which was the most appropriate at time t 1 is still appropriate at time t. Along this line, the controller defined from V and noted π selects at timetactiona t 1 as defined by Eq. (4). A further assumption behind the definition of π is that the noise of the delayed transition model estimate is moderate with respect to V. In other words, it is assumed that the peripheral pixels (unknown in truth and arbitrarily filled from s t) are not key to value function V (the corresponding weights have low absolute value). 4.4 Experimental setting The controller goal is to enable the follower robot to follow the leader robot. Eleven trajectories are recorded. Each trajectory is initialized with the follower positioned behind the leader at various locations in the lab, and both robots operated with the constant policy π 0(s) = Ahead for 128 time steps. The i-th trajectory thus records the follower state s (i) t,t = A computational time step amounts to circa.5 second of real-time (two frames per second), due to the on-board computation of the SURF descriptors. The value function V is learned from these trajectories as in section 3.2. The primary goal of experiments is to study the performance of controller π, in terms of the average time the follower can actually follow the leader. Further, in order to allow for larger time horizons, the controller run on the leader robot is an obstacle avoidance (Braitenberg) controller, enabling the leader to run for a couple of hundred time steps before being stopped. Note that modifying the leader policy amounts to considering a different environment, making the goal more challenging as the test setting differs from the training one: the leader can turn abruptly upon seeing an obstacle, whereas it turns only very gradually in the training trajectories. The second goal of experiments is to assess the robustness of the approach with respect to noise. A lesion study is conducted by perturbing the initial states in the training trajectories, e.g. recording the images seen by the follower (e.g. the walls or the experimenter) before the follower actually starts to follow the leader (Fig 2). The value function learned from the perturbed trajectories and the associated controller are referred to as NOISY-DIVA. The DIVA approach and the merits of a rank-based approach are also assessed by comparison with a simple regression based approach, referred to as REG-DIVA, where value function V is regressed from the training set E = {(s (i) t, t),i = ,t = }. Finally, the assumption that peripheral pixels are not relevant to value function V (section 4.3) is also examined, depicting the average weight of the value function for each (block of) pixel(s) in the Figure 3: Comparative results of DIVA, NOISY-DIVA and REG- DIVA: Histogram of the number of consecutive time steps the follower keeps on following a Braitenberg-operated leader, over 5 runs. One time step corresponds to ca half a second, due to the on-board computation of the SURF descriptors (2 frames per second). image. 4.5 Results The performance of the DIVA controller is assessed by learning V from all 11 training trajectoriesin DIVA NOISY-DIVA and REG- DIVA modes and running the associated π controllers. Every controller π is launched on the follower robot for five runs. In each run, the follower is initially positioned behind the Braitenberg-operated leader, at various locations in the lab (same initial locations for DIVA, NOISY-DIVA and REG-DIVA settings). In each run, the follower is manually repositioned behind the leader when it loses the track. The histogram over the 5 runs of the number of consecutive time steps while the follower does follow the leader is displayed in Fig. 3, reporting the respective performances of DIVA, NOISY-DIVA and REG-DIVA. As was expected, the best results are obtained for the noiseless rank-based DIVA approach, followed by the noisy rankbased NOISY-DIVA approach. The fact that some tracking sequences are very short is explained as the leader meets very soon an obstacle and turns fast, making the follower lose the track. Overall, the follower stays on the leader track for over 60 time steps in about 40% of the experiments while the track was lost after 52.7 (+-20.6) time steps in the training experiments. The performance of the DIVA controller is also assessed on the training trajectories along a leave-one-out procedure, learning 11 value functions V (i) from all trajectories but the i-th one. V (i) is computed on the remaining i-th trajectory (Fig. 4). As expected, the value of V (i) decreases along time. Interestingly, DIVA and NOISY- DIVA exhibit similar behaviors, rapidly decreasing as t increases although the value remains high for some runs; after visual inspection, the drift occurs late for these runs. The behavior of REG-DIVA is more erratic, which is explained as the regression setting is over constrained regarding the quality of the experiments, requiring all states s (i) t to be associated to the same value t. ECAI-12 Workshop on Preference Learning: Problems and Applications in AI 45

52 Figure 4: Comparative behavior of the value function V i along time. A leave-one-out procedure is used: the value function is learned from all but one trajectory, and its value on the remaining trajectory is reported. The value function learned by DIVA, NOISY-DIVA and REG-DIVA are reported respectively on the left, middle and right. Figure 5: Comparative behavior of controller π V i on the test trajectory along time, for DIVA (left), NOISY-DIVA (medium) and REG-DIVA (right). The corresponding policy π (i) is plot in Fig. 5, where π (i) (s (i) t ) is indicated by a cross respectively on, above or below over the central line depending on whether the selected action is Ahead, Right or Left. The ground truth obtained by visual inspection of the logs is reported below. It is seen that the policy generally selects the relevant action, and becomes chaotic in the end of the trajectory as the follower does no longer see the leader. Finally, the working assumption that peripheral pixels hardly matter for the value function (section 4.3), implying that the noise of the delayed transition model estimate does not harm the controller, is confirmed by inspecting the average weight of the pixels in Fig. 6. As could have been expected, the most important region is the central upper one, where the leader is seen at the beginning of each training trajectory; the low region overall is considered uninformative. In the upper regions, the weight decreases from the center to the left and right boundaries. 5 Discussion and perspectives This paper has presented a proof of principle of the DIVA approach, showing that elementary experiments can be used to prime the pump of reinforcement learning and train a mildly competent value function. Indeed, a controller with comparable performance might have been manually written (hacked) easily. Many a student experience suggests however that the manual approach is subject to the law of diminishing returns, as more and more efforts are required to improve the controller as its performance increases. The next step thus is to study the scalability of DIVA, e.g. by gradually adding new logs to the training set. On-going experiments consider the effects of adding the test trajectories (with Braitenberg- ECAI-12 Workshop on Preference Learning: Problems and Applications in AI 46

53 Figure 6: Average relevance of the visual regions to the value function for DIVA (left), NOISY-DIVA (medium) and REG-DIVA (right). operated leader) to the training trajectories to retrain the value function. The main current limitation of the approach concerns the small size of the action space. Still, it must be noted that the delayed transition model estimate naturally extends to fine-grained actions, by mapping the group of movement actions on the group of image translation vectors (section 4.2), thus enabling to determine the optimal action in a large action space. For computational efficiency, the underlying requirement is that the target behavior be sufficiently smooth. Currently, this requirement is hardly compatible with the low number of frames per second, due to the on-board computation of the SURF descriptors. On-going study is considering more affordable image pre-processings. Another applicative perspective is to learn a docking controller, enabling robots in a swarm to dock to each other. This goal is amenable to a DIVA approach, as training trajectories can be easily obtained by initially setting the robots in a docked position, and having every robot to undock and start wandering along time. Like for the follower problem, the best state is the initial one and the value of the robot state decreases along time, in the spirit of Murphy s law. Acknowledgments The authors gratefully acknowledge the support of the FP7 European Project Symbrion, FET IP , eu/, and the ANR Franco-Japanese project Sydinmalas ANR-08- BLAN A part of this research was supported by JST and the grants-in-aid for scientific research on fundamental research (B) and on challenging exploratory research REFERENCES [1] P. Abbeel and A.Y. Ng, Apprenticeship learning via inverse reinforcement learning, in ICML, ed., Carla E. Brodley, volume 69 of ACM International Conference Proceeding Series. ACM, (2004). [2] Riad Akrour, Marc Schoenauer, and Michèle Sebag, Preference-based policy learning, In Gunopulos et al. [12], pp [3] Riad Akrour, Marc Schoenauer, and Michèle Sebag, April: Active preference-based reinforcement learning, in ECML/PKDD, p. to appear, (2012). [4] G. Bakir, T. Hofmann, B. Scholkopf, A.J. Smola, B. Taskar, and S.V.N. Vishwanathan, Machine Learning with Structured Outputs, MIT Press, [5] Herbert Bay, Tinne Tuytelaars, and Luc J. Van Gool, Surf: Speeded up robust features, in Computer Vision - ECCV 2006, volume 3951 of Lecture Notes in Computer Science, pp , (2006). [6] Steffen Bickel, Christoph Sawade, and Tobias Scheffer, Transfer learning by distribution matching for targeted advertising, in Advances in Neural Information Processing Systems, NIPS 21, pp MIT Press, (2008). [7] Eric Bonabeau, Marco Dorigo, and Guy Theraulaz, Swarm Intelligence: From Natural to Artificial Systems, Santa Fe Institute Studies in the Sciences of Complexity, Oxford University Press, [8] Lipson H. Bongard J., Zykov V., Resilient machines through continuous self-modeling, Science, 314(5802), , (2006). [9] E. Brochu, N. de Freitas, and A. Ghosh, Active preference learning with discrete choice data, in Advances in Neural Information Processing Systems 20, pp , (2008). [10] S. Calinon, F. Guenter, and A. Billard, On Learning, Representing and Generalizing a Task in a Humanoid Robot, IEEE transactions on systems, man and cybernetics, Part B. Special issue on robot learning by observation, demonstration and imitation, 37(2), , (2007). [11] Weiwei Cheng, Johannes Fürnkranz, Eyke Hüllermeier, and Sang- Hyeun Park, Preference-based policy iteration: Leveraging preference learning for reinforcement learning, In Gunopulos et al. [12], pp [12] Dimitrios Gunopulos, Thomas Hofmann, Donato Malerba, and Michalis Vazirgiannis, eds. Proc. European Conf. on Machine Learning and Knowledge Discovery in Databases, ECML PKDD, Part I. LNCS 6911, Springer Verlag, [13] Thorsten Joachims, A support vector method for multivariate performance measures, in ICML, eds., Luc De Raedt and Stefan Wrobel, pp , (2005). [14] Thorsten Joachims, Training linear svms in linear time, in KDD, eds., Tina Eliassi-Rad, Lyle H. Ungar, Mark Craven, and Dimitrios Gunopulos, pp ACM, (2006). [15] J. Zico Kolter, Pieter Abbeel, and Andrew Y. Ng, Hierarchical apprenticeship learning with application to quadruped locomotion, in NIPS. MIT Press, (2007). [16] G. Konidaris, S. Kuindersma, A. Barto, and R. Grupen, Constructing skill trees for reinforcement learning agents from demonstration trajectories, in Advances in Neural Information Processing Systems 23, pp , (2010). [17] Michail Lagoudakis and Ronald Parr, Least-squares policy iteration, Journal of Machine Learning Research (JMLR), 4, , (2003). [18] A.Y. Ng and S. Russell, Algorithms for inverse reinforcement learning, in Proc. of the Seventeenth International Conference on Machine Learning (ICML-00), ed., P. Langley, pp Morgan Kaufmann, (2000). [19] R. Sutton and A. Barto, Reinforcement Learning: An Introduction, MIT Press, Cambridge, [20] Csaba Szepesvári, Algorithms for Reinforcement Learning, Morgan & Claypool, [21] Gerald Tesauro, Programming backgammon using self-teaching neural nets, Artif. Intell., 134(1-2), , (2002). [22] Paolo Viappiani and Craig Boutilier, Optimal Bayesian recommendation sets and myopically optimal choice query sets, in NIPS, pp , (2010). ECAI-12 Workshop on Preference Learning: Problems and Applications in AI 47

54 Large Scale Co-Regularized Ranking Evgeni Tsivtsivadze 1 and Katja Hofmann 2 and Tom Heskes 3 Abstract. As unlabeled data is usually easy to collect, semisupervised learning algorithms that can be trained on large amounts of unlabeled and labeled data are becoming increasingly popular for ranking and preference learning problems [6, 23, 8, 21]. However, the computational complexity of the vast majority of these (pairwise) ranking and preference learning methods is super-linear, as optimizing an objective function over all possible pairs of data points is computationally expensive. This paper builds upon [16] and proposes a novel large scale co-regularized algorithm that can take unlabeled data into account. This algorithm is suitable for learning to rank when large amounts of labeled and unlabeled data are available for training. Most importantly, the complexity of our algorithm does not depend on the size of the dataset. We evaluate the proposed algorithm using several publicly available datasets from the information retrieval (IR) domain, and show that it improves performance over supervised methods. Finally, we discuss possible implications of our algorithm for learning with implicit feedback in an online setting. 1 Introduction and background Our paper proposes an algorithm that is applicable to large scale learning to rank. Unlike existing approaches the proposed algorithm can take into account unlabeled data, leading to improved ranking performance. Learning to rank algorithms have been successfully applied to various domains such as IR [12], bioinformatics [13], and automated reasoning [11]. One of the bottlenecks associated with ranking tasks is the quadratic dependence on the size of the dataset. That is, most of the (pairwise) methods suffer from the computational burden of optimizing an objective defined over O(m 2 ) possible pairs for data points, where m is the size of the dataset. Usually, the complexity of ranking algorithms has super-linear dependency on m, except the work of [16] where the use of stochastic gradient descent on pairs results in an extremely efficient training procedure with strong generalization performance. Pairwise learning to rank with stochastic gradient descent results in a scalable methodology that we refer to as stochastic pairwise descent (SPD). 1 Institute for Computing and Information Sciences, Radboud University, The Netherlands & MSB Group, The Netherlands Organization for Applied Scientific Research, Zeist, The Netherlands. evgeni@science.ru.nl 2 Intelligent Systems Lab Amsterdam, University of Amsterdam, The Netherlands. k.hofmann@uva.nl 3 Institute for Computing and Information Sciences, Radboud University, The Netherlands. t.heskes@science.ru.nl In [16] it is demonstrated that such stochastic gradient based learning methods provide state of the art results using a fraction of a second of CPU time for training. However, SPD is only applicable to supervised learning problems. A natural and useful extension of SPD is an extension to semisupervised learning, where, in addition to labeled data, a large amount of unlabeled data is used for training. We address this problem and present a large scale co-regularized ranking algorithm for semi-supervised tasks. Our large scale co-regularized ranking algorithm (LCRA) is formulated within a multi-view framework. In this framework the dataset attributes (i.e., features) are split into independent sets and an algorithm is trained based on these different views. The goal of the learning process is to find a prediction function for every view performing well on the labeled data in such a way that all prediction functions agree on the unlabeled data. Closely related to this approach is the co-regularization framework described in [18], where the same idea of agreement maximization between the predictors is central. Recently it has been demonstrated that the co-regularization approach works well for various tasks e.g. domain adaptation [7], classification, regression [2], and clustering [3]. Moreover, theoretical investigations demonstrate that co-regularization reduces the Rademacher complexity by an amount that depends on the distance between the views [15, 19]. We think that our co-regularization algorithm is particularly promising in online learning to rank for IR settings, where a search engine learns from the limited feedback that can be inferred from direct interactions with a search engine users. In this paper, we first focus on co-regularization (in Section 2) and our co-regularized pairwise learning algorithm (3 and 4). Finally, we discuss possible extensions to the online learning to rank for IR setting (Section 5). 2 Large Scale Pairwise Learning to Rank and Co-regularization Consider a training set D = (X, Y, Q), where X = (x 1,..., x m) T R m contains n dimensional feature vectors of the data points, Y = (y 1,..., y m) T N m are the scores/ranks, and Q = (q 1,..., q m) T N m are the indices for identification to which group/query a particular data point belongs. Note that each entry x i X consists of features that encode the relation between a particular item (e.g., a document) to group or query q i. In addition to the training set D = (X, Y, Q) with labeled data we have a training set ˆD = ( ˆX, ˆQ) with unlabeled data points. Also, let us consider M different hypotheses spaces H 1,..., H M or socalled views. These views stem from different representations ECAI-12 Workshop on Preference Learning: Problems and Applications in AI 48

55 of the data points, unique subsets of features. Finally, we define a set of candidate pairs P and ˆP (implied by the datasets D and ˆD) as the set of all tuples ((a, y a, q a), (b, y b, q b )) and ((a, q a), (b, q b )), where y a y b and q a = q b. Co-regularized algorithms are usually not straightforwardly applicable to large scale learning tasks, when large amounts of unlabeled as well as labeled data are available for training. Several recently proposed algorithms have complexity that is linear in the number of unlabeled data points and superlinear in the number of labeled examples (e.g. cubic as in case of co-regularized least squares [2, 23]). Such methods become infeasible to use as the dataset size increases. In particular, this applies to the pairwise learning setting (note that in the worst case P grows quadratically with the dataset size). 2.1 Constructing the pairs In the pairwise learning setting it is important to be able to sample from P and ˆP without explicitly constructing the datasets of pairs. Several approaches to address this problem are suggested in [16]. For simplicity we adopt one of the most basic techniques: we repeatedly select two examples (a, y a, q a) and (b, y b, q b ) from the data until a pair is found such that y a y b and q a = q b. Then we construct a feature vector of the corresponding pair as p = a b and the label y = y a y b. 3 The Algorithm Stochastic gradient based algorithms are amongst the most popular approaches for large scale learning. Methods such as Pegasos [17], LaSVM [1], GURLS [22] and many others have been successfully applied to large scale classification and regression problems, leading to state-of-the-art generalization performance. Recently, the SPD algorithm [16] has been successfully used to address large scale learning to rank tasks. The main idea is to sample candidate pairs from P for stochastic steps, without constructing P explicitly. This avoids dependence on P. In essence, the approach proposed in [16] reduces learning to rank to learning a binary classifier via stochastic gradient descent. This reduction preserves the convergence properties of stochastic gradient descent. Our algorithm is related to the above mentioned methods but is preferable in case unlabeled data points are available for learning. Let us consider the large scale co-regularized ranking algorithm. We write the objective function as P M J(W ) = L(p v i, y i; w v ) + λl R(w v ) (1) v=1 +µ i=1 M v,u=1 v u ˆP L C(p v i, p u i ; w v, w u ), i=1 where the first term corresponds to the loss function on the labeled pairs and the second term to a regularization on the individual prediction functions. The third is the coregularization term that measures the disagreement between the different prediction functions on unlabeled pairs. Note that W R M n is a matrix containing weight vectors for different views. Once the model is trained the final prediction can be obtained, for example, by averaging individual predictions for different views (as in [15]). We can approximate the optimal solution (obtained when minimizing (1)) by means of gradient descent w v t+1 = w v t η v t w v J(W ). (2) Let us consider the setting in which the squared loss function is used for co-regularization, and the L 2 norm is used for regularization. Choosing the squared loss for the co-regularization term is quite natural as it penalizes the differences among the prediction functions constructed for multiple views (similar to the standard regression setting where the differences between the predicted and true scores are penalized). For every iteration t of the algorithm, we first construct pairs via the procedure described in section 2.1 and denote the set of selected pairs by A t P of size k. Similarly we choose Ât ˆP of size l for each round t on the unlabeled dataset. Let us also denote by A v t the set A t as seen in the view v. Then, we replace the true objective (1) with an approximate objective function and write the update rule as follows w v t+1 = (1 η v t λ)w v t η v t 4µη v t M v,u=1 v u (p,y Âv t Âu t ) (p,y A v t ) L(p v, y; w v t ) ( w vt t p v w ut t p u) p v. (3) Note that if we choose A t to contain a single randomly selected pair, we recover a variant of the stochastic gradient method. In general, we allow A t to be a set of k and Ât to be a set of l data points sampled i.i.d. from P and ˆP, respectively. Recall that in the setting described above we are solving a learning to rank problem via reduction to classification of pairs of data points. For classification tasks, the hinge loss is usually considered as more appropriate, although in several studies it has been empirically demonstrated that the squared loss often leads to similar performance (see [14, 27]). The update rule using the hinge loss is derived as follows. Let us define A v+ to be the set of examples for which w v obtains a non-zero loss, that is A v+ = {(p v, y) A v t : y p v, w v < 1}. Then by substituting the second term in equation (3) with ηt v (p,y A v+ ) ypv we obtain the update rule for the large scale co-regularized algorithm with hinge loss. When the squared loss function is used for labeled and unlabeled data we obtain the update rule by substituting the second term in equation (3) with ηt v (p,y A v t )(y wvt p v )p v. Our large scale co-regularized ranking algorithm has complexity of O(Md), where d is the number of nonzero elements in p v. The pseudocode is shown in Algorithm 1. 4 Experiments The task of ranking query-document pairs is a problem central to document retrieval - given a query some of the available documents are more relevant in regards to it than some others. Because the user will usually be most interested in the top results returned, document retrieval systems are typically evaluated using performance measures such as mean average precision (MAP). ECAI-12 Workshop on Preference Learning: Problems and Applications in AI 49

56 Algorithm 1 Large scale co-regularized ranking algorithm (LCRA-k-l) Require: Datasets D and ˆD, regularization parameter λ, batch sizes k and l, number of iterations N, number of views M, co-regularization parameter µ. Ensure: W = 0 1: for t = 1, 2,..., N do 2: Construct A t P (using procedure from Sec 2.1), where A t = k and Ât ˆP, where Ât = l 3: for v = 1, 2,..., M do 4: Set A v+ t = {(p v, y) A v t : y p v, wt v < 1} 5: Set ηt v = 1 λt 6: wt+1 v (1 ηt v λ)wt v ηt v 7: Output W (p,y A v+ t ) ypv 4µηt v Mv,u=1 v u (p,y Âvt Âut ) ( w vt t p v wt ut p u) p v To benchmark the performance of our algorithm we use Letor 3.0 (LEarning TO Rank) - a collection of several datasets extracted from corresponding IR data collections. The whole collection consists of a set of document-query pairs. Each document-query pair is represented as an example with a quite small number of highly abstract features. Our experiments are performed on each of the datasets separately. We preprocess the datasets by normalizing all feature values to values between 0 and 1 on a per query basis. We use the classical learning setting, where 70% of the data is used for training and the remaining 30% as testing. To simulate a semisupervised learning setting, a subset of 20% of the training data is randomly selected to be used as labeled data. From the remaining training data, labels are removed. We compare the performance of our large scale coregularized ranking algorithm with several other methods, namely the baseline - supervised - version of the algorithm (without co-regularization), which is in equivalent to SPD [16] and to the pairwise Pegasos algorithm [17]. We also compare with the multi-view version of the algorithm, also excluding the co-regularization term, referred to as SPD MV. The comparison is made with several instantiations of the large scale co-regularized ranking algorithm, termed as LCRA-k-l, using various sizes of unlabeled batch examples. For the supervised learning algorithms, only the labeled part of the dataset is used for training. The same set is then used for training the co-regularized model, together with the unlabeled data. Parameter selection for each model is done by 5-fold crossvalidation over the training partition of the data. For the supervised models, parameters to be selected are learning rate η 0 and regularization parameter λ. For the supervised and semi-supervised multi-view models we consider two views that are constructed via random partitioning of the data attributes into two unique sets. Such division of the attributes for constructing multiple views has been previously used in [2]. For the multi-view model we have to estimate the learning rate η 0, as well as the λ 1 and λ 2 parameters. The semi-supervised model has an additional parameter µ controlling the influence of the co-regularization on model selection. The results of our experiments are included in Table 1. It can be observed that in all experiments the proposed LCRA algorithm outperforms supervised learning methods. The obtained results are expected, as it has been previously demonstrated that co-regularization leads to improved classification performance. Note that our algorithm can be considered a pairwise classification approach for learning to rank. Dataset LCRA-1-5 LCRA-1-1 SPD SPD MV TD TD NP NP HP HP OHSUMED Table 1. MAP-performance comparison of the LCRA algorithm and the baseline methods on the Letor dataset. Note that results of supervised learning algorithms are not comparable to previously reported benchmarks on Letor dataset due to the fact that they are trained only on 20% of the labeled data points. 5 Co-regularization in Online Learning to Rank for IR In the previous sections we introduced a co-regularization algorithm for semi-supervised learning to rank. We think that this approach is particularly promising in the context of online learning to rank for IR, and discuss its application below. In online learning for IR a search engine learns improved ranking functions by directly interacting with a user 4 [10, 25]. It is typically modeled as a contextual bandit problem 5 [20], where the query is the context provided by the user, and feedback can be inferred from user clicks on result documents that the search engine returned in response to the query. 5.1 Online learning for IR The most important difference between the online learning to rank setting and the traditionally considered supervised learning to rank for IR setting is the feedback available to the learning algorithm. Like in other reinforcement learning (RL) settings [20], a retrieval system that learns from user interactions can only infer feedback about the documents or document rankings that it actually presents to the user (and that are inspected by the user). This results in much more limited information to learn from than is available in the supervised 4 Here we use online in the RL sense, meaning that learning and application of the learned solution are performed at the same time. Note that this differs from the term s use in the optimization literature, where is usually refers to the scalability of algorithms. 5 Contextual bandit problems are a type of reinforcement learning problem actions (i.e., presented documents) depend on the context (i.e., the query), but not on previous interactions between system and environment. ECAI-12 Workshop on Preference Learning: Problems and Applications in AI 50

57 setting, where it is assumed that labels are exhaustive [24], or sampled in some systematic way [4]. In addition to the limited amount of feedback available in the online learning to rank for IR setting, the quality of the feedback is constrained as follows. Users of a retrieval system expect a ranked list of results, that is more or less ordered by the usefulness of these results given their information need (expressed as e.g., a text query). Consequently, they are most likely to inspect results presented at the top of the returned result list, and continue examining and/or interacting with documents at subsequently lower ranks until their information need is met, or until they run out of time, patience, or some other restricted resource. For the most effective learning outcomes, this means that the documents (or pairs of documents) on which feedback would be most useful for learning should be presented first (we call this strategy of presenting result documents exploration). However, the documents that are most useful for learning may not be the ones that are most likely to fulfill the users information need (exploitation). Consequently, an effective learning algorithm should balance exploration and exploitation to optimize online performance, i.e., performance while learning from user interactions. 5.2 Related work Several recent approaches address the problem of learning to rank for IR in an online setting. Yue et al. formulate the Dueling Bandit problem [26] and the K-armed Bandit problem [25]. In both formulations, learning is based on observing pairwise feedback on complete rankings, obtained using so-called interleaved comparison methods, where user clicks are observed on specially constructed result lists that allow inferring a preference between the two rankings [5, 9]. The algorithms proposed to address these tasks follow an exploitthen-explore approach, where it is assumed an algorithm can learn quickly before starting to exploit what has been learned, and performance while learning is largely ignored. Follow-up work suggests that online performance can be improved by balancing exploration and exploitation [10]. In this work, it was shown that different learning approaches are affected by a balance of exploration and exploitation in different ways. For the listwise Dueling Bandit approach, it was shown that the originally proposed, purely exploratory algorithm over-explored, and that effective learning could be achieved by injecting only two exploratory documents into an otherwise exploitative result list. For the alternative pairwise approach, that is the most similar to the SPD approach evaluated in this paper, it was found that a purely exploitative algorithm performed very well when feedback could be reliably inferred from user clicks. However, when user feedback was noisy, bias introduced by the preference of users for higher-ranked results caused learning outcomes to deteriorate dramatically. To combat this performance loss, exploration had to be increased (which, in its simplest form can consist of random exploration). 5.3 Relation to online co-regularization The co-regularization algorithm presented in this paper is directly applicable to existing pairwise learning approaches for the online learning to rank for IR setting. Because labeled feedback is particularly limited in the start-up phase of an online learning task, high initial learning gains are expected in this setting when easily obtainable unlabeled data can be used to complement this data. The resulting higher-quality result lists are expected to result in more reliable feedback [10]. As a result, learning could be sped up while reducing the need for exploration, leading to increased online performance. An experimental investigation of this hypothesis will be conducted in follow-up work. 6 Conclusion In this paper we have presented a large-scale co-regularized algorithm for ranking and preference learning. Our algorithm can use unlabeled data to improve learning when labeled data is scarce. Our experiments on 7 standard learning to rank data sets show that our co-regularization component consistently improves performance over algorithms without coregularization. We think that this approach can be particularly beneficial in an online learning to rank setting, where algorithms learn directly from interacting with users and effective use of limited feedback is paramount. We conclude with a brief outlook on future work in this area. Acknowledgments This research was partially supported by the Netherlands Organisation for Scientific Research (NWO) under project nr and a Vici grant References [1] Leon Bottou, Antoine Bordes, and Seyda Ertekin. Lasvm, [2] Ulf Brefeld, Thomas Gärtner, Tobias Scheffer, and Stefan Wrobel, Efficient co-regularised least squares regression, in ICML 06, pp ACM, (2006). [3] Ulf Brefeld and Tobias Scheffer, Co-em support vector learning, in ICML 04, p. 16, New York, NY, USA, (2004). ACM. [4] Ben Carterette, James Allan, and Ramesh Sitaraman, Minimal test collections for retrieval evaluation, in SIGIR 06, pp ACM, (2006). [5] Olivier Chapelle, Thorsten Joachims, Filip Radlinski, and Yisong Yue, Large-scale validation and analysis of interleaved search evaluation, ACM Trans. Inf. Syst., 30(1), 6:1 6:41, (2012). [6] W Chu and Z Ghahramani, Extensions of Gaussian processes for ranking: semi-supervised and active learning, in NIPS Workshop on Learning to Rank, pp , (2005). [7] Hal Daume, Abhishek Kumar, and Avishek Saha, Coregularization based semi-supervised domain adaptation, in NIPS 10, eds., J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R.S. Zemel, and A. Culotta, , (2010). [8] Kevin Duh and Katrin Kirchhoff, Semi-supervised ranking for document retrieval, Comput. Speech Lang., 25(2), , (2011). [9] K. Hofmann, S. Whiteson, and M. de Rijke, A probabilistic method for inferring preferences from clicks, in CIKM 11, pp , (2011). [10] K. Hofmann, S. Whiteson, and M. de Rijke, Balancing exploration and exploitation in listwise and pairwise online learning to rank for information retrieval, Information Retrieval Journal (to appear), (2012). [11] Daniel Kühlwein, Josef Urban, Evgeni Tsivtsivadze, Herman Geuvers, and Tom Heskes, Multi-output ranking for automated reasoning, in KDIR 11, (2011). ECAI-12 Workshop on Preference Learning: Problems and Applications in AI 51

58 [12] Tie-Yan Liu, Learning to rank for information retrieval, Foundations and Trends in Information Retrieval, 3(3), , (2009). [13] Iain Melvin, Jason Weston, Christina S. Leslie, and William Stafford Noble, Rankprop: a web server for protein remote homology detection, Bioinformatics, 25(1), , (2009). [14] Ryan Rifkin, Gene Yeo, and Tomaso Poggio, Regularized least-squares classification, in Advances in Learning Theory: Methods, Model and Applications, eds., J.A.K. Suykens, G. Horvath, S. Basu, C. Micchelli, and J. Vandewalle, volume 190 of NATO Science Series III: Computer and System Sciences, chapter 7, , IOS Press, Amsterdam, (2003). [15] David Rosenberg and Peter L. Bartlett, The Rademacher complexity of co-regularized kernel classes, in AISTATS 07, eds., Marina Meila and Xiaotong Shen, pp , (2007). [16] D. Sculley, Large Scale Learning to Rank, in NIPS Workshop on Advances in Ranking, (2009). [17] Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro, Pegasos: Primal Estimated sub-gradient SOlver for SVM, in ICML 07, pp ACM, (2007). [18] Vikas Sindhwani, Partha Niyogi, and Mikhail Belkin, A coregularization approach to semi-supervised learning with multiple views, in ICML Workshop on Learning with Multiple Views, (2005). [19] Vikas Sindhwani and David Rosenberg, An RKHS for multiview learning and manifold co-regularization, in ICML 08, eds., Andrew McCallum and Sam Roweis, pp , Helsinki, Finland, (2008). [20] Richard S. Sutton and Andrew G. Barto, Reinforcement learning: An introduction, MIT Press, Cambridge, MA, USA, [21] Martin Szummer and Emine Yilmaz, Semi-supervised learning to rank with preference regularization, in CIKM 11, pp ACM, (2011). [22] Andrea Tacchetti, Pavan Mallapragada, Matteo Santoro, and Lorenzo Rosasco, GURLS: a toolbox for large scale multiclass learning, in NIPS 2011 workshop on parallel and large-scale machine learning. [23] Evgeni Tsivtsivadze, Tapio Pahikkala, Jorma Boberg, Tapio Salakoski, and Tom Heskes, Co-regularized least-squares for label ranking, in Preference Learning, eds., Eyke Hüllermeier and Johannes Fürnkranz, pp , (2010). [24] Ellen M. Voorhees and Donna K. Harman, TREC: Experiment and Evaluation in Information Retrieval, Digital Libraries and Electronic Publishing, MIT Press, [25] Yisong Yue, Josef Broder, Robert Kleinberg, and Thorsten Joachims, The k-armed dueling bandits problem, Journal of Computer and System Sciences, 78(5), , (2012). [26] Yisong Yue and Thorsten Joachims, Interactively optimizing information retrieval systems as a dueling bandits problem, in ICML 09, pp , (2009). [27] Peng Zhang and Jing Peng, Svm vs regularized least squares classification, in Proceedings of the International Conference on Pattern Recognition, ICPR 04, pp , (2004). ECAI-12 Workshop on Preference Learning: Problems and Applications in AI 52

59 First Steps Towards Learning from Game Annotations Christian Wirth and Johannes Fürnkranz 1 Abstract. Most of the research in the area of evaluation function learning is focused on self-play. However in many domains, like chess, expert feedback is amply available in the form of annotated games. This feedback comes usually in the form of qualitative information due to the inability of humans to determine precise utility values for game states. We are presenting a first step towards integrating this qualitative feedback into evaluation function learning by reformulating it in terms of preferences. We extract preferences from large-scale database for annotated chess games and use them for calculating the feature weights of a heuristic chess position evaluation function. This is achieved by extracting the feature weights out of the linear kernel from a learned SVMRANK model, based upon the given preference relations. We evaluate the resulting function by creating multiple heuristics based upon different sized subsets of the trainings data and compare them in a tournament scenario. Although our results did not yield a better chess playing program, the results confirm that preferences derived from game annotations may be used to learn chess evaluation functions. 1 Introduction For many problems, human experts are able to demonstrate good judgment about the quality of certain courses of actions or solution attempts. Typically, this information is of qualitative nature (e.g., A treatment a is more effective than treatment b), and cannot be expressed numerically without selecting arbitrary values. This is due to the fact that humans are not able to determine a precise utility value of an option, but are typically able to compare the quality of two options. The emerging field of preference learning tries to make this information usable in the field of machine learning, by introducing concepts and methods for applying qualitative preferences to a wide variety of learning problems [12]. In the game of chess, qualitative human feedback is amply available in the form of game notations. For example, the company Chessbase 3 specializes in the collection and distribution of chess databases. Their largest database contains annotations for over games, according to the official page. However, this rich source of information about the game has so far been ignored in the literature on machine learning in chess [21, 11]. Much of the work in this area has concentrated on the application of reinforcement learning algorithms to learn meaningful evaluation functions [3, 4, 7]. These approaches have all been modeled after the success of TD-Gammon [24], a learning system that uses temporal-difference learning [22] for training a game evaluation function [23]. However, all these algorithms were trained exclusively on self-play, entirely ignoring human feedback that is readily available in annotated game databases. In this paper, we report the results of a first study that aims at learning a heuristic function for chess based on a large amount of qualitative feedback from experts available in annotated game database. In particular, we show how preferences can be extracted from chess databases, and show how state-of-the-art ranking algorithms can be used to successfully learn an evaluation function. The learning setup is based on the methodology used in [16], where it has been used for learning evaluation functions from move preferences of chess players of different strengths. However, to our knowledge this is the first work that reports on results from learning evaluation functions from game annotations. In Section 2, we are explaining which information can be contained in annotations for chess games, especially concerning portable game notation files with numeric annotation glyphes [8]. A widely available data format. Section 3 details the object ranking by preferences method in general, as well as how to extract the preference information. In our experimental setup (Section 5), we are training a SVM with the preference data, based upon the state feature values given by a strong chess engine. This enables the creation of a new heuristic evaluation function by using the learned (linear) SVM model. The quality of the resulting function is evaluated in a chess engine tournament. Section 7 is concluding the paper and gives a short overview over possible further work. 2 Game Annotations in Chess Chess is a game of great interest, which has generated a large amount of literature that analyzes the game. Particularly popular are game annotations, which are frequently published after important or interesting games have been played in tournaments. These annotations reflect the analysis of a particular game by a (typically) strong professional chess player., They have been produced without any time constraints, and the annotators can resort to any means they deem necessary for improving their judgement (such as consulting colleagues, books, or computers). Thus, these annotations are usually of a high quality. Annotated chess games are amply available, not only in chess books or magazines. Chess databases, such as those provided by companies like Chessbase 3, are storing millions of games, many of them annotated. Chess players of all strengths use them regularly to study the game or to prepare against their next opponent. Chess annotators use a standardized set of symbols for annotating moves and positions, which have been popularized by the Chess Informant book series. Portable game notation (PGN) files are chess games recorded in standard algebraic notation with optional numeric annotation glyphes (NAG) [8]. Those annotation symbols can be divided into three major categories: move, position and time evaluation. 1 TU Darmstadt, Germany, [cwirth,juffi]@ke.tu-darmstadt.de ECAI-12 Workshop on Preference Learning: Problems and Applications in AI 53

Figure 1. An annotated chess game (screen-shot taken from http://chessbase.com/). move evaluation: Each move can be annotated with a symbol indicating its quality.

60 Figure 1. An annotated chess game (screen-shot taken from move evaluation: Each move can be annotated with a symbol indicating its quality. Six symbols are commonly used: very poor move (??), poor move (?), speculative move (?!), interesting move (!?), good move (!), very good move (!!). position evaluation: Each move can be annotated with a symbol indicating the quality of the position it is leading to: white has a decisive advantage (h), white has a moderate advantage (c), white has a slight advantage (f), equal chances for both sides (j), black has a slight advantage (g), black has a moderate advantage (e), black has a decisive advantage (i), the evaluation is unclear (k). time evaluation: Each move can be annotated with a symbol indicating a time constraint that arose at this move. This information is not used in our experiments. In addition to annotating games with NAG symbols, annotators can also add textual comments and move variations to the game, i.e., in addition to the moves that have actually been played in the course of the game, an annotator provides alternative lines of play. Those are usually suggestions in the form of short move chains that are leading to more promising states than the move chain used in the real game. Variations can also have NAG symbols, and may contain subvariations. Figure 1 shows an example for an annotated chessgame. The lefthand side shows the game position after the 13th move of white. Here, black is in a difficult position after the mistake he made. (12...Qg6? ). From the suggested moves, 13...a5?! is the best, but even here white has the upper hand at the end of the variation (18.Rec1!c ), as well as in the end of the suggested move chain starting with 13...QXc2. On the other hand, 13...NXc2?? is an even worse choice, ending in a position that is clearly lost for black (h). It is important to note that this feedback is of qualitative nature, i.e., it is not clear what the expected reward is in terms of, e.g., percentage of won games from a position with evaluation c. However, it is clear that positions with evaluation c are preferable to positions with evaluation f or worse (j, g, e, i). Also note that the feedback for positions typically applies to the entire sequence of moves that has been played up to reaching this position (a trajectory in reinforcement learning terminology). The qualitative position evaluations may be viewed as providing an evaluation of the trajectory that lead to this particular position, whereas the qualitative move evaluations may be viewed as evaluations of the expected value of a trajectory that starts at this point. However, even though there is a certain correlation between these two types of annotations (good moves tend to lead to better positions and bad moves tend to lead to worse positions), they are not interchangable. A very good move may be the only move that saves the player from imminent doom, but must not necessarily lead to a very good position. Conversely, a bad move may be a move that misses a chance to mate the opponent right away, but the resulting position may still be good for the player. 3 Learning an Evaluation Function from Preferences For learning the mentioned SVM model, it is required to formulate the task as a binary classification problem. We are showing how this can be done by using preference learning. 3.1 Preference Learning Preference learning is about inducing predictive preference models from empirical data. This establishes a connection between machine learning and research fields like preference modeling or decision making. Especially learning to rank by preferences is deemed promising by the community. Preference learning can be applied to label ranking, by defining preferences over a set of labels concerning a specific set of objects [25]. But it is also possible to define preference directly over a set of objects, for creating a ranking of those objects [14]. Preferences themselves are constraints that can be violated, which leads to higher flexibility concerning the solving process, opposed to hard constraints. These constraints can be described via a utility function or preference relations. [12] We are only considering preference relations in this work, because they can be represented in a qualitative manner. Object Ranking is about learning how to order a subset of objects out of a (potentially infinite) reference set Z. Those objects z Z are usually given as a vector of attribute/value pairs, but this is no necessary property. The trainings data is given in the form of rankings, which is decomposed into a finite set of pairwise preferences z i z j. The object ranker is then learning a ranking function f( ) which returns a (ranked) permutation of a given object set. [12] 3.2 States and Actions In chess, we are searching for the best action a A for a state s S. For game tree exploration concerns or suboptimal play, it can also be required to determine the expected quality of an suboptimal action ECAI-12 Workshop on Preference Learning: Problems and Applications in AI 54

Efficient Top-N Recommendation for Very Large Scale Binary Rated Datasets

Efficient Top-N Recommendation for Very Large Scale Binary Rated Datasets Fabio Aiolli University of Padova, Italy aiolli@math.unipd.it ABSTRACT We present a simple and scalable algorithm for top-n recommendation