Ranking Clustered Data with Pairwise Comparisons

Ranking Clustered Data with Pairwise Comparisons Alisa Maas ajmaas@cs.wisc.edu 1. INTRODUCTION 1.1 Background Machine learning often relies heavily on being able to rank the relative fitness of instances in a large set of data, but in many settings the only reasonable way to acquire information on the ranking is through making pairwise comparisons between the instances. Oftentimes, directly comparing two data points requires human intervention, which can be expensive in terms of time and other resources. Work by Jamieson and Nowak [3] mitigates this problem by providing an algorithm for minimizing the number of pairwise comparisons necessary to rank a data set, assuming that the set follows a particular structure. In particular, they assume that the set can be embedded in a Euclidean space and that each point s fitness is inversely proportional to its distance from an unknown reference point. The algorithm presented in that paper achieves full ranking of the sample space with Θ(d log n) comparisons in the average case, where d is the dimensionality of the Euclidean space and n is the number of samples. This is a nontrivial improvement over general-purpose comparisonbased sorting algorithms which require Ω(n log n) pairwise comparisons even in the average case. The work also presents a version of the algorithm that works in the noisy case, where each query to the comparison oracle has some probability of returning an incorrect result (and keeps returning the same result on further queries). This returns a full ranking of some subset of the samples that is consistent with all query responses and uses Θ(d log 2 n) queries in the average case. 1.2 Contribution We generalize Jamieson and Newok s algorithm to perform on data sets whose rankings follow a more general form. Specifically, we expect the data to be partitioned into k clusters where each cluster has its own unknown reference point, and the distance from each data point to its corresponding reference point determines its fitness. We demonstrate that this extended algorithm achieves Θ(kd log n + n log k) comparisons in the noiseless case, and additionally that it performs robustly in both the noiseless and noisy cases, on both points drawn uniformly from a hypercube and realworld data on product reviews, respectively. 2. APPROACH Since we expect clustered data, our input consists not just of a set of points, but of a mapping from point to cluster as well. As in Jamieson and Nowak s algorithm, we do not assume the reference point will be provided with the rest of the input. Our algorithm is fully general, and when run with only a single cluster, performs comparably to Jamieson and Nowak s algorithm. In contrast, Jamieson and Nowak s algorithm relies on the assumption that all of the points are ranked by their distance to a single reference point, and it is plain to see that their algorithm does not perform correctly when samples are ranked by different reference points. 2.1 Noiseless Case In the noiseless case, we sort each set of clustered samples independently, using a modified version of Jamieson and Nowak s algorithm (we worked entirely in the primal formulation, while they work in the dual). Jamieson and Nowak use linear programming to impute data about pairwise comparisons that haven t yet been performed, given the additional information that all the samples are sorted based on their relative distance to an unknown reference point. A direct comparison is only ever made when a point s comparison data is ambiguous. Comparison data between x, y is ambiguous if the reference point could be closer to x, or could be closer to y, taking into account all previous constraints on its location due to the incomplete ordering we have thus far attained. This result is very closely tied to the assumption that all the points are ranked by their distance to a single reference point, hence is it clear that their results do not generalize well if the samples are ranked relative to different reference points. Jamieson and Nowak use the number of possible orderings of the data to prove theoretical upper bounds on the number of comparisons they require. Our implementation of Jamieson and Nowak s algorithm can be used in conjunction with any sorting algorithm, but we chose to use Python s timsort implementation. After running Jamieson and Nowak s algorithm, we have k sets of lists, which are all relatively sorted. We merge each set of clustered points pairwise, in the manner of the merge step of mergesort, incurring additional overhead of Θ(n log k). Thus for the noiseless case, we have total overhead of Θ(kd log n+ n log k). Compare this to the minimal number of comparisons that could be made by any sorting algorithm on data with this structure. A lower bound on the minimal number of possible comparisons can be obtained by considering the minimal number of pairwise comparisons on the maximal number of possible orderings. This is obtained when all the clustered points are split evenly across the clusters. Thus no sorting algorithm can do no better than max (2dk log n, k log k), k

which is derived by considering sorting each individual cluster with no overhead for merging. As the number of clusters approaches the number of samples, the lower bound on the runtime approaches Ω(n log n). In the case that k = 1, this is exactly the order of comparisons required by Jamieson and Nowak s algorithm, Θ(d log n). 2.2 Noisy Case In the noisy case, we also sort each set of clustered points using Jamieson and Nowak s algorithm. In this case, rather than accepting any pairwise comparison as correct, Jamieson and Nowak assume there may be some noise in the comparisons. Thus, they construct a voting set from previously sorted pairs to determine where each element belongs. This is most intuitive to implement with an insertion sort, though other sorting algorithms could be used. The size of the voting set is a parameter, R, which can be tuned based on the expected error in the data set. If there is a plurality decision among the voting set, we abide by its decision. Otherwise, we throw out the element we are trying to insert, as its data is inconsistent with earlier data obtained. If we throw out a sample, we will never use it in a voting set, and it will not be part of the final ranking. If we cannot construct a voting set of size R for a given sample, we will also throw it out. This means that our outputted result is neither guaranteed to be perfectly sorted, nor is it guaranteed to include all of the original elements, but in practice, we find that the nearly all elements are mostly sorted. Once we have the sorted clusters, we must merge them. Unlike before, however, we cannot trust each individual comparison within the merge. Thus, we construct another voting set. If we are merging elements x, y from clusters A, B, and comparing x and y yields the result x < y, we compare x to a voting set of neighbors of y in B, where y is ranked lower than each element in the voting set. Symmetrically, if we find x > y, we construct a voting set for y from A. Rather than throw out elements whose voting sets do not reach a plurality decision, we randomly select one of x, y to append to the end of the merged list. In practice, we find that merging in this manner does not substantially increase the number of misordered elements beyond the number already present from Jamieson and Nowak s algorithm. We separately tune the size of this voting set, which we call R. As we increase the sizes of R, R, we find as a general trend that the number of comparisons rise, and the inversions drop. 3. EXPERIMENTAL SETUP 3.1 Noiseless Case We present noiseless experimental data obtained by performing experiments similar to [3]. Similarly to Jamieson and Nowak, we use points on the unit hypercube, with d = 10, 20,..., 100. To simulate clustered data, we pick k = 1, 2, 4 reference points uniformly at random. We cluster each point with the closest reference point to it. Note that our algorithm is fully general, and could support randomly clustering the points as well. We graph the required number of queries in order to obtain a completely sorted set. 3.2 Noisy Case 3.2.1 Data Gathering For the noisy case, we found that the data set used by Jamieson and Nowak was no longer available. Thus our results may not be properly comparable to theirs in this case. In order to consider the case where k > 1, we needed a data set that was already clustered (possibly by machine), and ranked by humans. We elected to use reviews from Amazon across several books. Specifically, we downloaded all the reviews denoted as positive and critical (as clustered by Amazon) for The Fault in Our Stars, Game of Thrones: A Song of Fire and Ice, and Life of Pi. We considered the ranking of each review to be the proportion of readers who found the review helpful, which is reported by Amazon. We found after some experimentation that the positive clusters were not useful for our purposes, as positive reviews elicited fewer votes, so each cluster was much more noisy. Our motivation for using critical reviews on between different products would be to present a user in the market for a particular type of product with critical reviews from most to least helpful, enabling a more informed decision. We adopted a bag of words approach on each review, removing common stop words from consideration. We also removed from consideration reviews that had fewer than 20 votes. Below this threshold, we believe the data was much more noisy and the helpfulness proportion serves as poor representation of the ranking. Each cluster had approximately 30 reviews contained within it, and all the review sets had similar distributions of helpfulness proportions. Now we have a Euclidean representation of our data, and a hypothesis that its bag of words representation relates to its ranking, but our dimensionality far exceeds our number of samples. Our bag of words approach resulted in points of dimension roughly R 4000, while we wanted to consider approximately 30 samples per cluster. Recall that the complexity of our algorithm is Θ(kd log n + n log k). Clearly, with d >> n, our algorithm is infeasible. We used multidimensional scaling [1] to reduce the dimensionality to 10 and 15, and compare the results in each case. Generally, we found that for d = 10, we used fewer comparisons, which is unsurprising given the complexity of our algorithm, but also had more out of order points. Thus the dimensionality for the bag of words approach appears to be a trade off between number of comparisons and accuracy of ranking. 3.2.2 Sources of Noise There are several sources of noise in our data set. Firstly, it is possible that bag of words implementation we used does not closely correspond to actual helpfulness of the review in the manner that we expect. There may not be a single theoretically perfect review for a product, and even if there is, the utility of each review may not be perfectly represented by its distance from that review. Humans have different tastes, and based on which people voted on reviews, we may find that the ranking gathered by Amazon does not generalize well. However, we consider this to be unlikely, due to the high number of votes we required on a review. It is also possible that the review ranks are systematically voted up or down based not on the merit of the content, but because of the voter s preferences about the novel. We tried to select for book reviews without this problem. Finally, it is also possible that Amazon s method of clustering reviews by positive and critical may be faulty. In that event, some of the clusters could contain incorrect samples. In practice, our results indicate the definite presence of noise within our

data set, though it is difficult to tell the precise source of the noise. However, our results are robust in the face of this noise, and performance is within an acceptable margin of error given the noisiness of the data. 3.2.3 Evaluation In order to evaluate the noisy case, we determine the number of inversions produced by a run of our algorithm over a set of data. Inversions are the number of pairs of elements that are improperly ordered. We calculate the maximum number of inversions possible on a partial ranking for the outputted size, and consider the proportion of total inversions found to maximum inversions possible to be a suitable metric for the sortedness of the list. For each pair of sets of critical reviews, we performed 20 sorts on the reviews. For each sort, we randomly shuffled the reviews prior to sorting. We aggregate data on the number of comparisons in each sort, the proportion of inversions to maximum possible inversions, and the length of each partial order. In general, we found a dichotomy between inversion proportion and number of pairwise comparisons - the more comparisons we made, the fewer inversions were produced. We also compared triples of sets of critical reviews as well as pairs; these results are discussed in more detail in Results. As expected, our number of inversions and number of comparisons are higher when merging three sets of items as compared to two. 4. RESULTS 4.1 Noiseless Case In the noiseless case, we see in Figure 1 that the number of pairwise comparisons (queries) follows the trend that we would expect. As we have more clusters, the overhead from the merge becomes more significant, and approaches the cost of mergesort. This problem is exacerbated with higher dimensionality. This motivates our multidimensional scaling in the noisy case, as the original dimensionality of the data would be prohibitively high compared to the number of reviews we obtained. 4.2 Noisy Case In the noisy case, we see some robustness data in Figure 2. In this case, there is not a clear baseline for the number of comparisons we may be trying to beat, as the partial order requires assuming that some comparison data may be erroneous. However, we compare the number of queries actually made to the number of all possible pairwise comparisons, and find that we obtain consistently under half of all possible pairwise comparisons. With 10 dimensional data, we find that the number of queries is significantly lower overall. This is due to properties of Jamieson and Nowak s algorithm. As the number of dimensions increases, the number of constraints that must be built into the linear program before any data can be imputed increases, requiring more and more comparisons before any benefit is obtained. The proportion of inversions observed appears consistently under 0.4, which is significantly better than random sorting. There appears to be very small differences between d = 10 and d = 15, though generally d = 15 has a slightly lower proportion of inversions. The proportion of sorted elements is visibly larger for d = 15, but both d = 10 and d = 15 obtain data that is at least 80% sorted, much better than randomly permuting the data. We also attempted to fine tune good values for R and R on our data set. Using The Fault in Our Stars and Game of Thrones, we found R, R had optimal values around R = 7, R = 5. This appears to provide a good balance of a low proportion of inversions (and thus a highly sorted output) and a small number of comparisons. For these values of R, R, we had an average inversion proportion of 0.33 and averaged 835.95 comparisons for d = 15 among 20 trials, and average inversion proportion of 0.38 and averaged 599.9 comparisons for d = 10 among 20 trials. Using all three sets of reviews obtained an average of 36.415% inversions and 1180.7 comparisons over 20 trials, for R = 10, R = 5. This is clearly greater than the number of comparisons used with only two sets of reviews, but there is a greater number of data (about 30% more data) to sort overall, and so the expected number of comparisons with any sort should be higher. 5. CONCLUSION In conclusion, we provide an extension to Jamieson and Nowak s algorithm which sorts data whose fitness corresponds to distance to one of k reference points. On noiseless data, we obtain Θ(kd log n + n log k) comparisons. In the noisy case, we provide ways of detecting and dealing with noise inherent in most data sets which still sorts most of the data set. On real world data, we obtain results that indicate our algorithm is competitive with traditional sorting approaches, and with Jamieson and Nowak s algorithm, and may offer some relief when pairwise comparisons are expensive. 5.1 Future Work We would like to extend this work in several concrete ways. Firstly, there may be more optimal approaches to merging the k clusters of data that can reduce the overhead of the sort significantly. More sophisticated merging approaches, such as the one provided by Hwang and Lin [2], may further reduce the number of comparisons required. Secondly, we would like to experiment with different methods of merging the noisy clustered data and compare to find the optimal approach. Our current implementation never removes any elements beyond the ones removed by Jamieson and Nowak s algorithm. We may see better results if instead of randomly choosing which element to add in the case that the voting subset cannot decide, we instead removed one or both of the troublesome elements. Additionally, we might be able to reduce the number of comparisons if we took the votes cast by the voting subset into account. If the voting subset agrees that x y, then some of that subset must belong before x. In that case, we may be able to move some of the voting subset into its proper place in the merged array without doing further comparisons. Thirdly, it may be possible to compare optimal values of R, R, the proportion of inversions and the number of comparisons to determine the noise in a particular data set. Since R, R are tuned based on the proportion of noise in the data set, in theory it may be possible to work backward to determine the amount of noise in the data set. 6. ACKNOWLEDGMENTS This work was done in collaboration with Kevin Kowalski.

7. REFERENCES [1] M.A.A. Cox and T.F. Cox. Multidimensional scaling. In Handbook of data visualization, pages 315 347. Springer-Verlag, 2008. [2] F.K. Hwang and S. Lin. Optimal merging of 2 elements with n elements. Acta Informatica, 1(2):145 158, 1971. [3] Kevin G. Jamieson and Robert D. Nowak. Active ranking using pairwise comparisons. CoRR, abs/1109.3701, 2011.

Figure 1: The number of comparisons requested by dimension per different numbers of cluster sizes in the noiseless case. The dashed line at the top represents the baseline number of queries required to sort the sets using mergesort. Figure 2: Here we see data on the noisy case, indicating the proportion of queries compared to obtaining all pairwise comparisons, the proportion of inversions to maximum possible inversions, and the proportion of sorted elements.