Ranking Clustered Data with Pairwise Comparisons

Ranking Clustered Data with Pairwise Comparisons Kevin Kowalski nargle@cs.wisc.edu 1. INTRODUCTION Background. Machine learning often relies heavily on being able to rank instances in a large set of data by some measure of relative fitness, but in many settings the only reasonable way to acquire information on the ranking is through making pairwise comparisons between the instances. Oftentimes, directly comparing these instances requires human intervention, which can be expensive in terms of time and other resources. Work by Jamieson and Nowak in [JN11] mitigates this problem by providing an algorithm for minimizing the number of pairwise comparisons necessary to rank a data set, assuming that the set is structured in a certain way. In particular, [JN11] assumes that the set can be embedded in a Euclidean space and that each point s fitness is inversely proportional to its distance from an unknown reference point. The algorithm presented in that paper achieves full ranking of the sample space with Θ(d log n) comparisons in the average case, where d is the dimensionality of the Euclidean space and n is the number of samples. This is a nontrivial improvement over general-purpose comparisonbased sorting algorithms which require Ω(n log n) pairwise comparisons even in the average case. The work also presents a version of the algorithm that works in a noisy setting, where each query to the comparison oracle has some probability of returning an incorrect result (and keeps returning the same result on further queries). This returns a full ranking of some subset of the samples that is consistent with all query responses and uses Θ(d log 2 n) queries in the average case, assuming a constant probability of error. Contribution. We generalize the algorithm of [JN11] to perform on data sets with rankings that follow a more general form. Specifically, we expect the data to be partitioned into k clusters where each cluster has its own unknown reference point, and the distance from each data point to its corresponding reference point determines its fitness. We demonstrate that this extended algorithm achieves Θ(kd log n + n log k) comparisons in the noiseless case, and additionally that it performs robustly in both the noiseless and noisy cases on points drawn uniformly from a hypercube and on real-world data on product reviews, respectively. 2. METHODS The algorithm we present in this report relies on the algorithms of [JN11] as subroutines in both the noiseless and noisy settings, so we begin with a brief characterizations of each For the remainder of this section, we assume that S = {s 1,..., s n} is the set of data points, s i s j denotes the event that s i is ranked below s j, S 1 S k is a partitioning of S into k clusters, and d is the dimensionality of the points in S. Noiseless active ranking with one reference point. In the noiseless case, the procedure works by running any standard comparison-based sorting algorithm 1 on the set of points to be ranked, with the following exception. Whenever a comparison is to be made, the procedure first checks whether the outcome of the comparison can be imputed from the values of the comparisons it has already made. If the value of the comparison cannot be imputed, then the comparison oracle is queried, but otherwise the imputed value can be used instead. More specifically, the outcome of a comparison s k sl is ambiguous if there exists a ranking consistent with the extant comparisons that ranks s k below s l, and another that ranks s l below s k. Equivalently, a comparison is ambiguous if there exist candidate points of reference that produce each of these rankings. For any point of reference r = (r 1,..., r d ) and any points s i, s j S, let H ij be the hyperplane normal to and bisecting the line between s i and s j. Then, s i is ranked above s j in the ranking induced by r if and only if r lies on the side of H ij closer to s i (since H ij divides the space of all points into a half that is closer to s i and a half that is closer to s j). Hence, any candidate reference point r is consistent with the extant comparisons if for every comparison s i s j for which the outcome is known, r lies on the correct side of H ij. This defines a set of linear constraints on r, and given a comparison s k sl the goal is then to determine whether these constraints force the comparison to take a particular value. We can determine whether it is possible for s k to be ranked below s l by adding the linear constraint encoding s k s l to the rest and invoking a standard linear programming algorithm to determine whether a feasible r exists. We can also determine whether it is possible to rank s k above s l in exactly the same way, so this tells us whether s k is ambiguous, as desired. sl 1 For our experiment, we used TimSort, a combination of merge sort and insertion sort that is used as default by Python.

Noisy active ranking with one reference point. In the noisy setting, queries to the comparison oracle have some probability of returning an erroneous result, and this error is persistent across multiple queries regarding the same comparison. Hence, ibecomest is impossible in the general case to fully rank a list of samples, so our goal instead becomes to return a full ranking on a reasonably large subset of the original list that is consistent with all the queries we receive from the oracle. Working in this setting necessitates making two major changes to the noiseless procedure: 1. The underlying comparison-based sorting algorithm must be insertion sort. In order to create the ranking on a subset, the algorithm must build the subset one element at a time while ensuring that query results are consistent with all elements already in the subset. The structure of insertion sort is the only natural one for this purpose. 2. Suppose that we already have a partial ranking on some subset of the first l 1 samples, and we want to add s l to the ranking. If the outcome of a query s k sl is imputable, then we take the imputed value as truth, but if the outcome is ambiguous, then instead of querying the oracle and trusting in its response, we create a voting set of exactly R samples s j where R is a parameter of the algorithm. Each sample votes for the outcome s k s l if after querying the oracle on s k sj and s j s l (or imputing the values of these queries if they are imputable) we get that s k s j s l, or votes for the opposite outcome if s l s j s k. The sample abstains if s l, s k s j or vice versa. The plurality vote determines the outcome we accept as truth, and in the case of a tie we just directly query the oracle on s k sl. In order for a sample s j to be considered for the voting set, it must be possible for s j to lie between s k and s l (or else the sample would simply abstain). Another way of stating this condition is that the outcome of either s k sj or s j sl must be ambiguous with respect to all the query results we have accepted as true 2. If there are fewer than R samples that meet this criterion, then we likely will not be able to obtain enough data to rank s l accurately, so we delete s l from the ranking. Active ranking with multiple reference points. In the setting of multiple reference points, we are given S = S 1 S k as input, where each data point in S i is ranked according to its distance to the i-th reference point. A high-level description of our algorithm is as follows. Given S, we pass each cluster S i to the single-reference-point ranking algorithm in sequence, obtaining a ranking (or partial ranking in the noisy case) on the elements of each cluster. Then, we merge the k rankings together with k 1 invocations of our pairwise merge procedure. These pairwise merges are organized into a binary tree, so that if σ 1,..., σ k are the rankings of each cluster, then first we merge σ 1 and σ 2 to make σ 1,2, then σ 3 and σ 4 to make σ 3,4 and so on 3 2 The definition of ambiguous in this context is actually underspecified in [JN11], but this interpretation makes the algorithm perform about as well as the results in that paper lead us to expect. 3 If k is odd, then σ k gets merged with an arbitrary other ranking. until σ k 1 and σ k make σ k 1,k. After that we merge σ 1,2 and σ 3,4 to make σ 1,2,3,4, and this process continues until we get the final ranking σ 1,...,k. In the noiseless case, the pairwise merges work exactly like the standard merge procedure from merge sort in each iteration, we compare the least element of the first list to the least element of the second, then remove the lesser of the two from the corresponding list and add it to the sorted list we are building. In this case, we can prove an upper bound on the expected number of pairwise comparisons the algorithm makes. Proposition 1. Let S = S 1 S k be such that S i = n i for i [k], let σ be a ranking on S that is inducible by some k reference points, and let M(σ) denote the number of pairwise comparisons the algorithm makes to produce a full ranking on S. Then, E[M(σ)] = O(kd log n + n log k), where the expectation is over σ drawn from the uniform distribution of rankings inducible by k reference points, d is the dimensionality of each point in S, and n = i ni. Proof. By [JN11], the expected number of comparisons the noiseless single-reference-point algorithm makes on input of size n i is at most cd log n i for some constant c. The expected number of comparisons that the multiple-referencepoint algorithm takes is then i cd log ni plus the expected number of comparisons for the merging procedure. By the convexity of logarithms, cd log n i ckd log n/k = O(kd log n). i Any pairwise merge of lists of size a and b uses at most a+b comparisons since it takes at most one comparison to add each element to the fully sorted list, and since the pattern of merges forms a binary tree, each sample undergoes at most log 2 k merges. Hence, the entire merge procedure uses at most n log 2 k = O(n log k) comparisons. Adding this to the expected number of comparisons used to sort each cluster gives a total of O(kd log n + n log k), as desired. In the noisy case, the algorithm takes an additional parameter R that serves a purpose analogous to that of R in the single-reference-point algorithm. Since we cannot trust the result of the comparison between the least element of the first list and the least element of the second, after getting the result we create a voting set of size R. Let A and B be the two lists we are merging, and assume that they are sorted from least to greatest (so in particular, A[0] and B[0] are the least elements of the two lists). Without loss of generality, assume that the initial query returns that A[0] B[0]. Then, the voting set for the query consists of the elements {B[0], B[1],..., B[R 1]}, or all the elements in B if B has fewer than R elements. For each element B[j] in the voting set, we query the oracle on whether A[0] B[j], and each positive result counts as a vote for the result A[0] B[0] while each negative result counts as a vote for B[0] A[0]. The plurality vote indicates the outcome we accept as truth, and in case of a tie the outcome is chosen randomly. The intuition behind this voting procedure is that if B[0] A[0], then the remaining elements of the voting set will very likely vote for the correct outcome, and if A[0] B[0], then

one of two things will happen. If there are many (say, more than R ) elements in B that outrank A, then the voting set will likely vote overwhelmingly for the correct outcome, but if there are few (say, less than R /2), then the voting set will likely vote for the incorrect outcome. However, in this latter case, if we rank B[0] below A[0] we are unlikely to create many inversions between B[0] and A since B[0] and A[0] are likely to be close. 3. RESULTS AND INTERPRETATION We ran experiments to evaluate the quality of our multiplereference-point sorter in both the noiseless and noisy cases. Noiseless experiment. For this algorithm we adapted the noiseless experiment of [JN11] to the case of two reference points. In each trial, S was initialized to contain n = 100 points drawn uniformly at random from the hypercube [0, 1] d and two reference points were drawn uniformly at random from the same distribution. The partition S 1 was defined to be the subset of S consisting of those points closer to the first reference point than the second, and S 2 was defined to be the remaining points. For each value of d = 10, 20,..., 100, 20 trials were run, and the mean numbers of queries the algorithm used are plotted in Figure 1. As in [JN11], the number of queries used approaches an asymptote as the dimensionality increases. In the k = 1 case, the algorithm is exactly TimSort except that the values of certain comparisons are imputed wherever possible, so it is impossible for the algorithm to make more queries than TimSort on an given input. As dimensionality increases, values become harder to impute until they are completely impossible in the case of d = 100 (since in this case d = n, and by cleverly choosing the reference point one can make any ordering possible). For k = 2 and k = 4, the algorithm still outperforms the baseline of TimSort for small values of d, but do worse when the dimensionality is high. This worsening is due to the extra overhead incurred by the merge procedure. The overall numbers of queries are well within the bounds predicted by Proposition 1. Noisy experiment. We would have liked to use the same data set as in [JN11] to evaluate our noisy algorithm, but the Aural Sonar data set appears to have disappeared. Instead, critical reviews for both A Game of Thrones and The Fault in Our Stars were scraped from Amazon and ranked according to helpfulness for each review, Amazon gives users the option to mark a review as either helpful or unhelpful, so the helpfulness score is defined as the ratio of helpful votes to total votes. All reviews with fewer votes than a certain threshold value were excluded, giving us a total of 33 reviews for each book. After scraping, each review was mapped onto a bag-ofwords representation with nltk. To reduce the dimensionality of this representation, the samples were further mapped into [0, 1] 10 and [0, 1] 15 using non-metric multidimensional scaling, a dimensionality reduction technique that preserves the relative distances (in this case, Euclidean distance in the bag-of-words space) between points. A good reference for this technique can be found in [CC08]. Implicitly, we are assuming that helpful reviews for the same book are more similar to each other in word choice than to unhelpful reviews, which means that we might be able to induce something similar to the helpfulness ranking with a reference point in the low-dimensional Euclidean space. This is not obviously a safe assumption to make, but it is borne out by the strength of our results. For each of d = 10 and d = 15 representations, we ran our multiple-reference-point sorter with R = 11 and R = 5 for 20 iterations, where in each iteration the sorter received the samples in a random order. The number of queries used, the number of inversions between the partial ranking output by the algorithm and the correct ranking on those elements, and the number of elements in each partial order are all plotted in Figure 2, relative to the maximum possible values for each of these. Though the performance of our algorithm on this noisy data set is not directly comparable to the performance of the single-point algorithm on the Aural Sonar data set, the value for the proportion of inversions present is similar (at approximately 40% for d = 10 and 35% for d = 15) while the proportion of queries used is much higher (at approximately 30% for d = 10 and 40% for d = 15, compared to 15% in [JN11] for d = 2). There are a number of factors that influence the numbers we obtain. 1. Noise in the data set. The Aural Sonar data set used in [JN11] was likely much more amenable to being embedded in a Euclidean space than our ad hoc solution. Previous work in [PPA06] suggests that a faithful 2-dimensional embedding of the data set exists, which allows the algorithm to work with a very small number of comparisons. Our data set, on the other hand, has only intuition backing up its suitability. 2. The difficulty of the problem. The two books we chose have reviews with very similar average helpfulness ratings, so the true ranking on the merged list requires a great deal of interleaving. In this case, there are many more possible orderings for the samples than in the case where all the samples have a single reference point, so many more comparisons would be necessary to tease apart the ranking to a similar degree of accuracy. There are no values for proportion of ranked elements in [JN11], so it is difficult to say how well our algorithm does in that respect, but we seem to rank close to all of the elements in both the d = 10 and d = 15 cases. Between the d = 10 and d = 15 cases, the d = 15 case sorts a greater proportion of the elements with a smaller proportion of inversions, though requires a substantially greater number of queries to do so. This is unsurprising embedding the points into a higher-dimensional space allows for a more faithful representation of distances in the original space, which translates to a lower rate of error. On the flip side, it is more difficult to impute values in a higher-dimensional space, so we more often need to make queries to compute the value of a comparison.

Figure 1: Mean and standard deviation of the number of queries are plotted against the dimensionality of the points with n = 100. The dashed line represents the mean number of queries that TimSort uses on the same data sets. Figure 2: Mean values for various proportions are plotted here. The bars represent the maximum and minimum values of the proportions over 20 trials.

4. REFERENCES [CC08] M.A.A. Cox and T.F. Cox. Multidimensional scaling. In Handbook of data visualization, pages 315 347. Springer-Verlag, 2008. [JN11] Kevin G. Jamieson and Robert D. Nowak. Active ranking using pairwise comparisons. CoRR, abs/1109.3701, 2011. [PPA06] S. Philips, J. Pitton, and L. Atlas. Perceptual feature identification for active sonar echoes. In OCEANS 2006, pages 1 6, Sept 2006.