Ranking Clustered Data with Pairwise Comparisons

Size: px
Start display at page:

Download "Ranking Clustered Data with Pairwise Comparisons"

Transcription

1 Ranking Clustered Data with Pairwise Comparisons Alisa Maas 1. INTRODUCTION 1.1 Background Machine learning often relies heavily on being able to rank the relative fitness of instances in a large set of data, but in many settings the only reasonable way to acquire information on the ranking is through making pairwise comparisons between the instances. Oftentimes, directly comparing two data points requires human intervention, which can be expensive in terms of time and other resources. Work by Jamieson and Nowak [3] mitigates this problem by providing an algorithm for minimizing the number of pairwise comparisons necessary to rank a data set, assuming that the set follows a particular structure. In particular, they assume that the set can be embedded in a Euclidean space and that each point s fitness is inversely proportional to its distance from an unknown reference point. The algorithm presented in that paper achieves full ranking of the sample space with Θ(d log n) comparisons in the average case, where d is the dimensionality of the Euclidean space and n is the number of samples. This is a nontrivial improvement over general-purpose comparisonbased sorting algorithms which require Ω(n log n) pairwise comparisons even in the average case. The work also presents a version of the algorithm that works in the noisy case, where each query to the comparison oracle has some probability of returning an incorrect result (and keeps returning the same result on further queries). This returns a full ranking of some subset of the samples that is consistent with all query responses and uses Θ(d log 2 n) queries in the average case. 1.2 Contribution We generalize Jamieson and Newok s algorithm to perform on data sets whose rankings follow a more general form. Specifically, we expect the data to be partitioned into k clusters where each cluster has its own unknown reference point, and the distance from each data point to its corresponding reference point determines its fitness. We demonstrate that this extended algorithm achieves Θ(kd log n + n log k) comparisons in the noiseless case, and additionally that it performs robustly in both the noiseless and noisy cases, on both points drawn uniformly from a hypercube and realworld data on product reviews, respectively. 2. APPROACH Since we expect clustered data, our input consists not just of a set of points, but of a mapping from point to cluster as well. As in Jamieson and Nowak s algorithm, we do not assume the reference point will be provided with the rest of the input. Our algorithm is fully general, and when run with only a single cluster, performs comparably to Jamieson and Nowak s algorithm. In contrast, Jamieson and Nowak s algorithm relies on the assumption that all of the points are ranked by their distance to a single reference point, and it is plain to see that their algorithm does not perform correctly when samples are ranked by different reference points. 2.1 Noiseless Case In the noiseless case, we sort each set of clustered samples independently, using a modified version of Jamieson and Nowak s algorithm (we worked entirely in the primal formulation, while they work in the dual). Jamieson and Nowak use linear programming to impute data about pairwise comparisons that haven t yet been performed, given the additional information that all the samples are sorted based on their relative distance to an unknown reference point. A direct comparison is only ever made when a point s comparison data is ambiguous. Comparison data between x, y is ambiguous if the reference point could be closer to x, or could be closer to y, taking into account all previous constraints on its location due to the incomplete ordering we have thus far attained. This result is very closely tied to the assumption that all the points are ranked by their distance to a single reference point, hence is it clear that their results do not generalize well if the samples are ranked relative to different reference points. Jamieson and Nowak use the number of possible orderings of the data to prove theoretical upper bounds on the number of comparisons they require. Our implementation of Jamieson and Nowak s algorithm can be used in conjunction with any sorting algorithm, but we chose to use Python s timsort implementation. After running Jamieson and Nowak s algorithm, we have k sets of lists, which are all relatively sorted. We merge each set of clustered points pairwise, in the manner of the merge step of mergesort, incurring additional overhead of Θ(n log k). Thus for the noiseless case, we have total overhead of Θ(kd log n+ n log k). Compare this to the minimal number of comparisons that could be made by any sorting algorithm on data with this structure. A lower bound on the minimal number of possible comparisons can be obtained by considering the minimal number of pairwise comparisons on the maximal number of possible orderings. This is obtained when all the clustered points are split evenly across the clusters. Thus no sorting algorithm can do no better than max (2dk log n, k log k), k

2 which is derived by considering sorting each individual cluster with no overhead for merging. As the number of clusters approaches the number of samples, the lower bound on the runtime approaches Ω(n log n). In the case that k = 1, this is exactly the order of comparisons required by Jamieson and Nowak s algorithm, Θ(d log n). 2.2 Noisy Case In the noisy case, we also sort each set of clustered points using Jamieson and Nowak s algorithm. In this case, rather than accepting any pairwise comparison as correct, Jamieson and Nowak assume there may be some noise in the comparisons. Thus, they construct a voting set from previously sorted pairs to determine where each element belongs. This is most intuitive to implement with an insertion sort, though other sorting algorithms could be used. The size of the voting set is a parameter, R, which can be tuned based on the expected error in the data set. If there is a plurality decision among the voting set, we abide by its decision. Otherwise, we throw out the element we are trying to insert, as its data is inconsistent with earlier data obtained. If we throw out a sample, we will never use it in a voting set, and it will not be part of the final ranking. If we cannot construct a voting set of size R for a given sample, we will also throw it out. This means that our outputted result is neither guaranteed to be perfectly sorted, nor is it guaranteed to include all of the original elements, but in practice, we find that the nearly all elements are mostly sorted. Once we have the sorted clusters, we must merge them. Unlike before, however, we cannot trust each individual comparison within the merge. Thus, we construct another voting set. If we are merging elements x, y from clusters A, B, and comparing x and y yields the result x < y, we compare x to a voting set of neighbors of y in B, where y is ranked lower than each element in the voting set. Symmetrically, if we find x > y, we construct a voting set for y from A. Rather than throw out elements whose voting sets do not reach a plurality decision, we randomly select one of x, y to append to the end of the merged list. In practice, we find that merging in this manner does not substantially increase the number of misordered elements beyond the number already present from Jamieson and Nowak s algorithm. We separately tune the size of this voting set, which we call R. As we increase the sizes of R, R, we find as a general trend that the number of comparisons rise, and the inversions drop. 3. EXPERIMENTAL SETUP 3.1 Noiseless Case We present noiseless experimental data obtained by performing experiments similar to [3]. Similarly to Jamieson and Nowak, we use points on the unit hypercube, with d = 10, 20,..., 100. To simulate clustered data, we pick k = 1, 2, 4 reference points uniformly at random. We cluster each point with the closest reference point to it. Note that our algorithm is fully general, and could support randomly clustering the points as well. We graph the required number of queries in order to obtain a completely sorted set. 3.2 Noisy Case Data Gathering For the noisy case, we found that the data set used by Jamieson and Nowak was no longer available. Thus our results may not be properly comparable to theirs in this case. In order to consider the case where k > 1, we needed a data set that was already clustered (possibly by machine), and ranked by humans. We elected to use reviews from Amazon across several books. Specifically, we downloaded all the reviews denoted as positive and critical (as clustered by Amazon) for The Fault in Our Stars, Game of Thrones: A Song of Fire and Ice, and Life of Pi. We considered the ranking of each review to be the proportion of readers who found the review helpful, which is reported by Amazon. We found after some experimentation that the positive clusters were not useful for our purposes, as positive reviews elicited fewer votes, so each cluster was much more noisy. Our motivation for using critical reviews on between different products would be to present a user in the market for a particular type of product with critical reviews from most to least helpful, enabling a more informed decision. We adopted a bag of words approach on each review, removing common stop words from consideration. We also removed from consideration reviews that had fewer than 20 votes. Below this threshold, we believe the data was much more noisy and the helpfulness proportion serves as poor representation of the ranking. Each cluster had approximately 30 reviews contained within it, and all the review sets had similar distributions of helpfulness proportions. Now we have a Euclidean representation of our data, and a hypothesis that its bag of words representation relates to its ranking, but our dimensionality far exceeds our number of samples. Our bag of words approach resulted in points of dimension roughly R 4000, while we wanted to consider approximately 30 samples per cluster. Recall that the complexity of our algorithm is Θ(kd log n + n log k). Clearly, with d >> n, our algorithm is infeasible. We used multidimensional scaling [1] to reduce the dimensionality to 10 and 15, and compare the results in each case. Generally, we found that for d = 10, we used fewer comparisons, which is unsurprising given the complexity of our algorithm, but also had more out of order points. Thus the dimensionality for the bag of words approach appears to be a trade off between number of comparisons and accuracy of ranking Sources of Noise There are several sources of noise in our data set. Firstly, it is possible that bag of words implementation we used does not closely correspond to actual helpfulness of the review in the manner that we expect. There may not be a single theoretically perfect review for a product, and even if there is, the utility of each review may not be perfectly represented by its distance from that review. Humans have different tastes, and based on which people voted on reviews, we may find that the ranking gathered by Amazon does not generalize well. However, we consider this to be unlikely, due to the high number of votes we required on a review. It is also possible that the review ranks are systematically voted up or down based not on the merit of the content, but because of the voter s preferences about the novel. We tried to select for book reviews without this problem. Finally, it is also possible that Amazon s method of clustering reviews by positive and critical may be faulty. In that event, some of the clusters could contain incorrect samples. In practice, our results indicate the definite presence of noise within our

3 data set, though it is difficult to tell the precise source of the noise. However, our results are robust in the face of this noise, and performance is within an acceptable margin of error given the noisiness of the data Evaluation In order to evaluate the noisy case, we determine the number of inversions produced by a run of our algorithm over a set of data. Inversions are the number of pairs of elements that are improperly ordered. We calculate the maximum number of inversions possible on a partial ranking for the outputted size, and consider the proportion of total inversions found to maximum inversions possible to be a suitable metric for the sortedness of the list. For each pair of sets of critical reviews, we performed 20 sorts on the reviews. For each sort, we randomly shuffled the reviews prior to sorting. We aggregate data on the number of comparisons in each sort, the proportion of inversions to maximum possible inversions, and the length of each partial order. In general, we found a dichotomy between inversion proportion and number of pairwise comparisons - the more comparisons we made, the fewer inversions were produced. We also compared triples of sets of critical reviews as well as pairs; these results are discussed in more detail in Results. As expected, our number of inversions and number of comparisons are higher when merging three sets of items as compared to two. 4. RESULTS 4.1 Noiseless Case In the noiseless case, we see in Figure 1 that the number of pairwise comparisons (queries) follows the trend that we would expect. As we have more clusters, the overhead from the merge becomes more significant, and approaches the cost of mergesort. This problem is exacerbated with higher dimensionality. This motivates our multidimensional scaling in the noisy case, as the original dimensionality of the data would be prohibitively high compared to the number of reviews we obtained. 4.2 Noisy Case In the noisy case, we see some robustness data in Figure 2. In this case, there is not a clear baseline for the number of comparisons we may be trying to beat, as the partial order requires assuming that some comparison data may be erroneous. However, we compare the number of queries actually made to the number of all possible pairwise comparisons, and find that we obtain consistently under half of all possible pairwise comparisons. With 10 dimensional data, we find that the number of queries is significantly lower overall. This is due to properties of Jamieson and Nowak s algorithm. As the number of dimensions increases, the number of constraints that must be built into the linear program before any data can be imputed increases, requiring more and more comparisons before any benefit is obtained. The proportion of inversions observed appears consistently under 0.4, which is significantly better than random sorting. There appears to be very small differences between d = 10 and d = 15, though generally d = 15 has a slightly lower proportion of inversions. The proportion of sorted elements is visibly larger for d = 15, but both d = 10 and d = 15 obtain data that is at least 80% sorted, much better than randomly permuting the data. We also attempted to fine tune good values for R and R on our data set. Using The Fault in Our Stars and Game of Thrones, we found R, R had optimal values around R = 7, R = 5. This appears to provide a good balance of a low proportion of inversions (and thus a highly sorted output) and a small number of comparisons. For these values of R, R, we had an average inversion proportion of 0.33 and averaged comparisons for d = 15 among 20 trials, and average inversion proportion of 0.38 and averaged comparisons for d = 10 among 20 trials. Using all three sets of reviews obtained an average of % inversions and comparisons over 20 trials, for R = 10, R = 5. This is clearly greater than the number of comparisons used with only two sets of reviews, but there is a greater number of data (about 30% more data) to sort overall, and so the expected number of comparisons with any sort should be higher. 5. CONCLUSION In conclusion, we provide an extension to Jamieson and Nowak s algorithm which sorts data whose fitness corresponds to distance to one of k reference points. On noiseless data, we obtain Θ(kd log n + n log k) comparisons. In the noisy case, we provide ways of detecting and dealing with noise inherent in most data sets which still sorts most of the data set. On real world data, we obtain results that indicate our algorithm is competitive with traditional sorting approaches, and with Jamieson and Nowak s algorithm, and may offer some relief when pairwise comparisons are expensive. 5.1 Future Work We would like to extend this work in several concrete ways. Firstly, there may be more optimal approaches to merging the k clusters of data that can reduce the overhead of the sort significantly. More sophisticated merging approaches, such as the one provided by Hwang and Lin [2], may further reduce the number of comparisons required. Secondly, we would like to experiment with different methods of merging the noisy clustered data and compare to find the optimal approach. Our current implementation never removes any elements beyond the ones removed by Jamieson and Nowak s algorithm. We may see better results if instead of randomly choosing which element to add in the case that the voting subset cannot decide, we instead removed one or both of the troublesome elements. Additionally, we might be able to reduce the number of comparisons if we took the votes cast by the voting subset into account. If the voting subset agrees that x y, then some of that subset must belong before x. In that case, we may be able to move some of the voting subset into its proper place in the merged array without doing further comparisons. Thirdly, it may be possible to compare optimal values of R, R, the proportion of inversions and the number of comparisons to determine the noise in a particular data set. Since R, R are tuned based on the proportion of noise in the data set, in theory it may be possible to work backward to determine the amount of noise in the data set. 6. ACKNOWLEDGMENTS This work was done in collaboration with Kevin Kowalski.

4 7. REFERENCES [1] M.A.A. Cox and T.F. Cox. Multidimensional scaling. In Handbook of data visualization, pages Springer-Verlag, [2] F.K. Hwang and S. Lin. Optimal merging of 2 elements with n elements. Acta Informatica, 1(2): , [3] Kevin G. Jamieson and Robert D. Nowak. Active ranking using pairwise comparisons. CoRR, abs/ , 2011.

5 Figure 1: The number of comparisons requested by dimension per different numbers of cluster sizes in the noiseless case. The dashed line at the top represents the baseline number of queries required to sort the sets using mergesort. Figure 2: Here we see data on the noisy case, indicating the proportion of queries compared to obtaining all pairwise comparisons, the proportion of inversions to maximum possible inversions, and the proportion of sorted elements.

Ranking Clustered Data with Pairwise Comparisons

Ranking Clustered Data with Pairwise Comparisons Ranking Clustered Data with Pairwise Comparisons Kevin Kowalski nargle@cs.wisc.edu 1. INTRODUCTION Background. Machine learning often relies heavily on being able to rank instances in a large set of data

More information

COSC 311: ALGORITHMS HW1: SORTING

COSC 311: ALGORITHMS HW1: SORTING COSC 311: ALGORITHMS HW1: SORTIG Solutions 1) Theoretical predictions. Solution: On randomly ordered data, we expect the following ordering: Heapsort = Mergesort = Quicksort (deterministic or randomized)

More information

5. DIVIDE AND CONQUER I

5. DIVIDE AND CONQUER I 5. DIVIDE AND CONQUER I mergesort counting inversions closest pair of points randomized quicksort median and selection Lecture slides by Kevin Wayne Copyright 2005 Pearson-Addison Wesley Copyright 2013

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

Active Clustering and Ranking

Active Clustering and Ranking Active Clustering and Ranking Rob Nowak, University of Wisconsin-Madison IMA Workshop on "High-Dimensional Phenomena" (9/26-30, 2011) Gautam Dasarathy Brian Eriksson (Madison/Boston) Kevin Jamieson Aarti

More information

Ch5. Divide-and-Conquer

Ch5. Divide-and-Conquer Ch5. Divide-and-Conquer 1 5.1 Mergesort Sorting Sorting. Given n elements, rearrange in ascending order. Applications. Sort a list of names. Organize an MP3 library. obvious applications Display Google

More information

Treaps. 1 Binary Search Trees (BSTs) CSE341T/CSE549T 11/05/2014. Lecture 19

Treaps. 1 Binary Search Trees (BSTs) CSE341T/CSE549T 11/05/2014. Lecture 19 CSE34T/CSE549T /05/04 Lecture 9 Treaps Binary Search Trees (BSTs) Search trees are tree-based data structures that can be used to store and search for items that satisfy a total order. There are many types

More information

Estimating the Quality of Databases

Estimating the Quality of Databases Estimating the Quality of Databases Ami Motro Igor Rakov George Mason University May 1998 1 Outline: 1. Introduction 2. Simple quality estimation 3. Refined quality estimation 4. Computing the quality

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)

More information

CPSC 536N: Randomized Algorithms Term 2. Lecture 4

CPSC 536N: Randomized Algorithms Term 2. Lecture 4 CPSC 536N: Randomized Algorithms 2011-12 Term 2 Prof. Nick Harvey Lecture 4 University of British Columbia In today s lecture we will see two applications of negative binomially distributed random variables.

More information

Theorem 2.9: nearest addition algorithm

Theorem 2.9: nearest addition algorithm There are severe limits on our ability to compute near-optimal tours It is NP-complete to decide whether a given undirected =(,)has a Hamiltonian cycle An approximation algorithm for the TSP can be used

More information

CS229 Final Project: Predicting Expected Response Times

CS229 Final Project: Predicting Expected  Response Times CS229 Final Project: Predicting Expected Email Response Times Laura Cruz-Albrecht (lcruzalb), Kevin Khieu (kkhieu) December 15, 2017 1 Introduction Each day, countless emails are sent out, yet the time

More information

CPSC 536N: Randomized Algorithms Term 2. Lecture 5

CPSC 536N: Randomized Algorithms Term 2. Lecture 5 CPSC 536N: Randomized Algorithms 2011-12 Term 2 Prof. Nick Harvey Lecture 5 University of British Columbia In this lecture we continue to discuss applications of randomized algorithms in computer networking.

More information

HMMT February 2018 February 10, 2018

HMMT February 2018 February 10, 2018 HMMT February 2018 February 10, 2018 Combinatorics 1. Consider a 2 3 grid where each entry is one of 0, 1, and 2. For how many such grids is the sum of the numbers in every row and in every column a multiple

More information

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Seminar on A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Mohammad Iftakher Uddin & Mohammad Mahfuzur Rahman Matrikel Nr: 9003357 Matrikel Nr : 9003358 Masters of

More information

Predictive Indexing for Fast Search

Predictive Indexing for Fast Search Predictive Indexing for Fast Search Sharad Goel, John Langford and Alex Strehl Yahoo! Research, New York Modern Massive Data Sets (MMDS) June 25, 2008 Goel, Langford & Strehl (Yahoo! Research) Predictive

More information

Chapter 5. Divide and Conquer. Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved.

Chapter 5. Divide and Conquer. Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved. Chapter 5 Divide and Conquer Slides by Kevin Wayne. Copyright 25 Pearson-Addison Wesley. All rights reserved. Divide-and-Conquer Divide-and-conquer. Break up problem into several parts. Solve each part

More information

Classification: Feature Vectors

Classification: Feature Vectors Classification: Feature Vectors Hello, Do you want free printr cartriges? Why pay more when you can get them ABSOLUTELY FREE! Just # free YOUR_NAME MISSPELLED FROM_FRIEND... : : : : 2 0 2 0 PIXEL 7,12

More information

Allstate Insurance Claims Severity: A Machine Learning Approach

Allstate Insurance Claims Severity: A Machine Learning Approach Allstate Insurance Claims Severity: A Machine Learning Approach Rajeeva Gaur SUNet ID: rajeevag Jeff Pickelman SUNet ID: pattern Hongyi Wang SUNet ID: hongyiw I. INTRODUCTION The insurance industry has

More information

CSE 573: Artificial Intelligence Autumn 2010

CSE 573: Artificial Intelligence Autumn 2010 CSE 573: Artificial Intelligence Autumn 2010 Lecture 16: Machine Learning Topics 12/7/2010 Luke Zettlemoyer Most slides over the course adapted from Dan Klein. 1 Announcements Syllabus revised Machine

More information

Divide and Conquer 1

Divide and Conquer 1 Divide and Conquer A Useful Recurrence Relation Def. T(n) = number of comparisons to mergesort an input of size n. Mergesort recurrence. T(n) if n T n/2 T n/2 solve left half solve right half merging n

More information

Algorithms: Lecture 7. Chalmers University of Technology

Algorithms: Lecture 7. Chalmers University of Technology Algorithms: Lecture 7 Chalmers University of Technology Today s Lecture Divide & Conquer Counting Inversions Closest Pair of Points Multiplication of large integers Intro to the forthcoming problems Graphs:

More information

Evaluating Classifiers

Evaluating Classifiers Evaluating Classifiers Charles Elkan elkan@cs.ucsd.edu January 18, 2011 In a real-world application of supervised learning, we have a training set of examples with labels, and a test set of examples with

More information

Algorithms, Games, and Networks February 21, Lecture 12

Algorithms, Games, and Networks February 21, Lecture 12 Algorithms, Games, and Networks February, 03 Lecturer: Ariel Procaccia Lecture Scribe: Sercan Yıldız Overview In this lecture, we introduce the axiomatic approach to social choice theory. In particular,

More information

Exact Algorithms Lecture 7: FPT Hardness and the ETH

Exact Algorithms Lecture 7: FPT Hardness and the ETH Exact Algorithms Lecture 7: FPT Hardness and the ETH February 12, 2016 Lecturer: Michael Lampis 1 Reminder: FPT algorithms Definition 1. A parameterized problem is a function from (χ, k) {0, 1} N to {0,

More information

FINAL EXAM SOLUTIONS

FINAL EXAM SOLUTIONS COMP/MATH 3804 Design and Analysis of Algorithms I Fall 2015 FINAL EXAM SOLUTIONS Question 1 (12%). Modify Euclid s algorithm as follows. function Newclid(a,b) if a

More information

6 Distributed data management I Hashing

6 Distributed data management I Hashing 6 Distributed data management I Hashing There are two major approaches for the management of data in distributed systems: hashing and caching. The hashing approach tries to minimize the use of communication

More information

1 (15 points) LexicoSort

1 (15 points) LexicoSort CS161 Homework 2 Due: 22 April 2016, 12 noon Submit on Gradescope Handed out: 15 April 2016 Instructions: Please answer the following questions to the best of your ability. If you are asked to show your

More information

2. On classification and related tasks

2. On classification and related tasks 2. On classification and related tasks In this part of the course we take a concise bird s-eye view of different central tasks and concepts involved in machine learning and classification particularly.

More information

Structured Light II. Thanks to Ronen Gvili, Szymon Rusinkiewicz and Maks Ovsjanikov

Structured Light II. Thanks to Ronen Gvili, Szymon Rusinkiewicz and Maks Ovsjanikov Structured Light II Johannes Köhler Johannes.koehler@dfki.de Thanks to Ronen Gvili, Szymon Rusinkiewicz and Maks Ovsjanikov Introduction Previous lecture: Structured Light I Active Scanning Camera/emitter

More information

Adaptive Robotics - Final Report Extending Q-Learning to Infinite Spaces

Adaptive Robotics - Final Report Extending Q-Learning to Infinite Spaces Adaptive Robotics - Final Report Extending Q-Learning to Infinite Spaces Eric Christiansen Michael Gorbach May 13, 2008 Abstract One of the drawbacks of standard reinforcement learning techniques is that

More information

6 Randomized rounding of semidefinite programs

6 Randomized rounding of semidefinite programs 6 Randomized rounding of semidefinite programs We now turn to a new tool which gives substantially improved performance guarantees for some problems We now show how nonlinear programming relaxations can

More information

A New Combinatorial Design of Coded Distributed Computing

A New Combinatorial Design of Coded Distributed Computing A New Combinatorial Design of Coded Distributed Computing Nicholas Woolsey, Rong-Rong Chen, and Mingyue Ji Department of Electrical and Computer Engineering, University of Utah Salt Lake City, UT, USA

More information

Textural Features for Image Database Retrieval

Textural Features for Image Database Retrieval Textural Features for Image Database Retrieval Selim Aksoy and Robert M. Haralick Intelligent Systems Laboratory Department of Electrical Engineering University of Washington Seattle, WA 98195-2500 {aksoy,haralick}@@isl.ee.washington.edu

More information

Divide-and-Conquer. Combine solutions to sub-problems into overall solution. Break up problem of size n into two equal parts of size!n.

Divide-and-Conquer. Combine solutions to sub-problems into overall solution. Break up problem of size n into two equal parts of size!n. Chapter 5 Divide and Conquer Slides by Kevin Wayne. Copyright 25 Pearson-Addon Wesley. All rights reserved. Divide-and-Conquer Divide-and-conquer. Break up problem into several parts. Solve each part recursively.

More information

Lecture 24: Image Retrieval: Part II. Visual Computing Systems CMU , Fall 2013

Lecture 24: Image Retrieval: Part II. Visual Computing Systems CMU , Fall 2013 Lecture 24: Image Retrieval: Part II Visual Computing Systems Review: K-D tree Spatial partitioning hierarchy K = dimensionality of space (below: K = 2) 3 2 1 3 3 4 2 Counts of points in leaf nodes Nearest

More information

E±cient Detection Of Compromised Nodes In A Wireless Sensor Network

E±cient Detection Of Compromised Nodes In A Wireless Sensor Network E±cient Detection Of Compromised Nodes In A Wireless Sensor Network Cheryl V. Hinds University of Idaho cvhinds@vandals.uidaho.edu Keywords: Compromised Nodes, Wireless Sensor Networks Abstract Wireless

More information

Lecture 3: Sorting 1

Lecture 3: Sorting 1 Lecture 3: Sorting 1 Sorting Arranging an unordered collection of elements into monotonically increasing (or decreasing) order. S = a sequence of n elements in arbitrary order After sorting:

More information

Improved Brute Force Search Strategies for Single Trace and Few Traces Template Attacks on the DES Round Keys

Improved Brute Force Search Strategies for Single Trace and Few Traces Template Attacks on the DES Round Keys Improved Brute Force Search Strategies for Single Trace and Few Traces Template Attacks on the DES Round Keys Mathias Wagner, Stefan Heyse mathias.wagner@nxp.com Abstract. We present an improved search

More information

Comparing Implementations of Optimal Binary Search Trees

Comparing Implementations of Optimal Binary Search Trees Introduction Comparing Implementations of Optimal Binary Search Trees Corianna Jacoby and Alex King Tufts University May 2017 In this paper we sought to put together a practical comparison of the optimality

More information

Subspace Clustering with Global Dimension Minimization And Application to Motion Segmentation

Subspace Clustering with Global Dimension Minimization And Application to Motion Segmentation Subspace Clustering with Global Dimension Minimization And Application to Motion Segmentation Bryan Poling University of Minnesota Joint work with Gilad Lerman University of Minnesota The Problem of Subspace

More information

1 The range query problem

1 The range query problem CS268: Geometric Algorithms Handout #12 Design and Analysis Original Handout #12 Stanford University Thursday, 19 May 1994 Original Lecture #12: Thursday, May 19, 1994 Topics: Range Searching with Partition

More information

Character Recognition

Character Recognition Character Recognition 5.1 INTRODUCTION Recognition is one of the important steps in image processing. There are different methods such as Histogram method, Hough transformation, Neural computing approaches

More information

Data Mining. Lecture 03: Nearest Neighbor Learning

Data Mining. Lecture 03: Nearest Neighbor Learning Data Mining Lecture 03: Nearest Neighbor Learning Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Prof. R. Mooney (UT Austin) Prof E. Keogh (UCR), Prof. F. Provost

More information

CSE 202 Divide-and-conquer algorithms. Fan Chung Graham UC San Diego

CSE 202 Divide-and-conquer algorithms. Fan Chung Graham UC San Diego CSE 22 Divide-and-conquer algorithms Fan Chung Graham UC San Diego A useful fact about trees Any tree on n vertices contains a vertex v whose removal separates the remaining graph into two parts, one of

More information

Semi-supervised learning and active learning

Semi-supervised learning and active learning Semi-supervised learning and active learning Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Combining classifiers Ensemble learning: a machine learning paradigm where multiple learners

More information

CSE151 Assignment 2 Markov Decision Processes in the Grid World

CSE151 Assignment 2 Markov Decision Processes in the Grid World CSE5 Assignment Markov Decision Processes in the Grid World Grace Lin A484 gclin@ucsd.edu Tom Maddock A55645 tmaddock@ucsd.edu Abstract Markov decision processes exemplify sequential problems, which are

More information

Sensor Tasking and Control

Sensor Tasking and Control Sensor Tasking and Control Outline Task-Driven Sensing Roles of Sensor Nodes and Utilities Information-Based Sensor Tasking Joint Routing and Information Aggregation Summary Introduction To efficiently

More information

Voronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013

Voronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013 Voronoi Region K-means method for Signal Compression: Vector Quantization Blocks of signals: A sequence of audio. A block of image pixels. Formally: vector example: (0.2, 0.3, 0.5, 0.1) A vector quantizer

More information

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron CS 188: Artificial Intelligence Spring 2010 Lecture 24: Perceptrons and More! 4/20/2010 Announcements W7 due Thursday [that s your last written for the semester!] Project 5 out Thursday Contest running

More information

Adaptive Supersampling Using Machine Learning Techniques

Adaptive Supersampling Using Machine Learning Techniques Adaptive Supersampling Using Machine Learning Techniques Kevin Winner winnerk1@umbc.edu Abstract Previous work in adaptive supersampling methods have utilized algorithmic approaches to analyze properties

More information

CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp

CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp Chris Guthrie Abstract In this paper I present my investigation of machine learning as

More information

Decision Tree CE-717 : Machine Learning Sharif University of Technology

Decision Tree CE-717 : Machine Learning Sharif University of Technology Decision Tree CE-717 : Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Some slides have been adapted from: Prof. Tom Mitchell Decision tree Approximating functions of usually discrete

More information

Task Description: Finding Similar Documents. Document Retrieval. Case Study 2: Document Retrieval

Task Description: Finding Similar Documents. Document Retrieval. Case Study 2: Document Retrieval Case Study 2: Document Retrieval Task Description: Finding Similar Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 11, 2017 Sham Kakade 2017 1 Document

More information

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review Gautam Kunapuli Machine Learning Data is identically and independently distributed Goal is to learn a function that maps to Data is generated using an unknown function Learn a hypothesis that minimizes

More information

Advanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret

Advanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret Advanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret Greedy Algorithms (continued) The best known application where the greedy algorithm is optimal is surely

More information

Topics in Machine Learning

Topics in Machine Learning Topics in Machine Learning Gilad Lerman School of Mathematics University of Minnesota Text/slides stolen from G. James, D. Witten, T. Hastie, R. Tibshirani and A. Ng Machine Learning - Motivation Arthur

More information

Intro to Algorithms. Professor Kevin Gold

Intro to Algorithms. Professor Kevin Gold Intro to Algorithms Professor Kevin Gold What is an Algorithm? An algorithm is a procedure for producing outputs from inputs. A chocolate chip cookie recipe technically qualifies. An algorithm taught in

More information

Lecture 5: Duality Theory

Lecture 5: Duality Theory Lecture 5: Duality Theory Rajat Mittal IIT Kanpur The objective of this lecture note will be to learn duality theory of linear programming. We are planning to answer following questions. What are hyperplane

More information

Chapter 1 Divide and Conquer Algorithm Theory WS 2014/15 Fabian Kuhn

Chapter 1 Divide and Conquer Algorithm Theory WS 2014/15 Fabian Kuhn Chapter 1 Divide and Conquer Algorithm Theory WS 2014/15 Fabian Kuhn Divide And Conquer Principle Important algorithm design method Examples from Informatik 2: Sorting: Mergesort, Quicksort Binary search

More information

Sentiment analysis under temporal shift

Sentiment analysis under temporal shift Sentiment analysis under temporal shift Jan Lukes and Anders Søgaard Dpt. of Computer Science University of Copenhagen Copenhagen, Denmark smx262@alumni.ku.dk Abstract Sentiment analysis models often rely

More information

Extremal Graph Theory: Turán s Theorem

Extremal Graph Theory: Turán s Theorem Bridgewater State University Virtual Commons - Bridgewater State University Honors Program Theses and Projects Undergraduate Honors Program 5-9-07 Extremal Graph Theory: Turán s Theorem Vincent Vascimini

More information

3D Computer Vision. Structured Light II. Prof. Didier Stricker. Kaiserlautern University.

3D Computer Vision. Structured Light II. Prof. Didier Stricker. Kaiserlautern University. 3D Computer Vision Structured Light II Prof. Didier Stricker Kaiserlautern University http://ags.cs.uni-kl.de/ DFKI Deutsches Forschungszentrum für Künstliche Intelligenz http://av.dfki.de 1 Introduction

More information

Application of Support Vector Machine Algorithm in Spam Filtering

Application of Support Vector Machine Algorithm in  Spam Filtering Application of Support Vector Machine Algorithm in E-Mail Spam Filtering Julia Bluszcz, Daria Fitisova, Alexander Hamann, Alexey Trifonov, Advisor: Patrick Jähnichen Abstract The problem of spam classification

More information

Data Mining and Data Warehousing Classification-Lazy Learners

Data Mining and Data Warehousing Classification-Lazy Learners Motivation Data Mining and Data Warehousing Classification-Lazy Learners Lazy Learners are the most intuitive type of learners and are used in many practical scenarios. The reason of their popularity is

More information

CISC 4631 Data Mining

CISC 4631 Data Mining CISC 4631 Data Mining Lecture 03: Nearest Neighbor Learning Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Prof. R. Mooney (UT Austin) Prof E. Keogh (UCR), Prof. F.

More information

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google, 1 1.1 Introduction In the recent past, the World Wide Web has been witnessing an explosive growth. All the leading web search engines, namely, Google, Yahoo, Askjeeves, etc. are vying with each other to

More information

Accelerometer Gesture Recognition

Accelerometer Gesture Recognition Accelerometer Gesture Recognition Michael Xie xie@cs.stanford.edu David Pan napdivad@stanford.edu December 12, 2014 Abstract Our goal is to make gesture-based input for smartphones and smartwatches accurate

More information

UNLABELED SENSING: RECONSTRUCTION ALGORITHM AND THEORETICAL GUARANTEES

UNLABELED SENSING: RECONSTRUCTION ALGORITHM AND THEORETICAL GUARANTEES UNLABELED SENSING: RECONSTRUCTION ALGORITHM AND THEORETICAL GUARANTEES Golnoosh Elhami, Adam Scholefield, Benjamín Béjar Haro and Martin Vetterli School of Computer and Communication Sciences École Polytechnique

More information

CSC 447: Parallel Programming for Multi- Core and Cluster Systems

CSC 447: Parallel Programming for Multi- Core and Cluster Systems CSC 447: Parallel Programming for Multi- Core and Cluster Systems Parallel Sorting Algorithms Instructor: Haidar M. Harmanani Spring 2016 Topic Overview Issues in Sorting on Parallel Computers Sorting

More information

From acute sets to centrally symmetric 2-neighborly polytopes

From acute sets to centrally symmetric 2-neighborly polytopes From acute sets to centrally symmetric -neighborly polytopes Isabella Novik Department of Mathematics University of Washington Seattle, WA 98195-4350, USA novik@math.washington.edu May 1, 018 Abstract

More information

Distributed minimum spanning tree problem

Distributed minimum spanning tree problem Distributed minimum spanning tree problem Juho-Kustaa Kangas 24th November 2012 Abstract Given a connected weighted undirected graph, the minimum spanning tree problem asks for a spanning subtree with

More information

Comp Online Algorithms

Comp Online Algorithms Comp 7720 - Online Algorithms Notes 4: Bin Packing Shahin Kamalli University of Manitoba - Fall 208 December, 208 Introduction Bin packing is one of the fundamental problems in theory of computer science.

More information

26 The closest pair problem

26 The closest pair problem The closest pair problem 1 26 The closest pair problem Sweep algorithms solve many kinds of proximity problems efficiently. We present a simple sweep that solves the two-dimensional closest pair problem

More information

CHAPTER 6 IDENTIFICATION OF CLUSTERS USING VISUAL VALIDATION VAT ALGORITHM

CHAPTER 6 IDENTIFICATION OF CLUSTERS USING VISUAL VALIDATION VAT ALGORITHM 96 CHAPTER 6 IDENTIFICATION OF CLUSTERS USING VISUAL VALIDATION VAT ALGORITHM Clustering is the process of combining a set of relevant information in the same group. In this process KM algorithm plays

More information

Lecture 8 Parallel Algorithms II

Lecture 8 Parallel Algorithms II Lecture 8 Parallel Algorithms II Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Original slides from Introduction to Parallel

More information

Construction of Minimum-Weight Spanners Mikkel Sigurd Martin Zachariasen

Construction of Minimum-Weight Spanners Mikkel Sigurd Martin Zachariasen Construction of Minimum-Weight Spanners Mikkel Sigurd Martin Zachariasen University of Copenhagen Outline Motivation and Background Minimum-Weight Spanner Problem Greedy Spanner Algorithm Exact Algorithm:

More information

Parallel and Sequential Data Structures and Algorithms Lecture (Spring 2012) Lecture 16 Treaps; Augmented BSTs

Parallel and Sequential Data Structures and Algorithms Lecture (Spring 2012) Lecture 16 Treaps; Augmented BSTs Lecture 16 Treaps; Augmented BSTs Parallel and Sequential Data Structures and Algorithms, 15-210 (Spring 2012) Lectured by Margaret Reid-Miller 8 March 2012 Today: - More on Treaps - Ordered Sets and Tables

More information

HALF&HALF BAGGING AND HARD BOUNDARY POINTS. Leo Breiman Statistics Department University of California Berkeley, CA

HALF&HALF BAGGING AND HARD BOUNDARY POINTS. Leo Breiman Statistics Department University of California Berkeley, CA 1 HALF&HALF BAGGING AND HARD BOUNDARY POINTS Leo Breiman Statistics Department University of California Berkeley, CA 94720 leo@stat.berkeley.edu Technical Report 534 Statistics Department September 1998

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

The Curse of Dimensionality

The Curse of Dimensionality The Curse of Dimensionality ACAS 2002 p1/66 Curse of Dimensionality The basic idea of the curse of dimensionality is that high dimensional data is difficult to work with for several reasons: Adding more

More information

Using Statistics for Computing Joins with MapReduce

Using Statistics for Computing Joins with MapReduce Using Statistics for Computing Joins with MapReduce Theresa Csar 1, Reinhard Pichler 1, Emanuel Sallinger 1, and Vadim Savenkov 2 1 Vienna University of Technology {csar, pichler, sallinger}@dbaituwienacat

More information

Spectral Clustering and Community Detection in Labeled Graphs

Spectral Clustering and Community Detection in Labeled Graphs Spectral Clustering and Community Detection in Labeled Graphs Brandon Fain, Stavros Sintos, Nisarg Raval Machine Learning (CompSci 571D / STA 561D) December 7, 2015 {btfain, nisarg, ssintos} at cs.duke.edu

More information

Random projection for non-gaussian mixture models

Random projection for non-gaussian mixture models Random projection for non-gaussian mixture models Győző Gidófalvi Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 92037 gyozo@cs.ucsd.edu Abstract Recently,

More information

Solution for Homework set 3

Solution for Homework set 3 TTIC 300 and CMSC 37000 Algorithms Winter 07 Solution for Homework set 3 Question (0 points) We are given a directed graph G = (V, E), with two special vertices s and t, and non-negative integral capacities

More information

Design and Analysis of Algorithms

Design and Analysis of Algorithms Design and Analysis of Algorithms CSE 5311 Lecture 8 Sorting in Linear Time Junzhou Huang, Ph.D. Department of Computer Science and Engineering CSE5311 Design and Analysis of Algorithms 1 Sorting So Far

More information

Sorting Algorithms. Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar

Sorting Algorithms. Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Sorting Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003. Topic Overview Issues in Sorting on Parallel

More information

Iterative Voting Rules

Iterative Voting Rules Noname manuscript No. (will be inserted by the editor) Iterative Voting Rules Meir Kalech 1, Sarit Kraus 2, Gal A. Kaminka 2, Claudia V. Goldman 3 1 Information Systems Engineering, Ben-Gurion University,

More information

One-Point Geometric Crossover

One-Point Geometric Crossover One-Point Geometric Crossover Alberto Moraglio School of Computing and Center for Reasoning, University of Kent, Canterbury, UK A.Moraglio@kent.ac.uk Abstract. Uniform crossover for binary strings has

More information

Clustering. RNA-seq: What is it good for? Finding Similarly Expressed Genes. Data... And Lots of It!

Clustering. RNA-seq: What is it good for? Finding Similarly Expressed Genes. Data... And Lots of It! RNA-seq: What is it good for? Clustering High-throughput RNA sequencing experiments (RNA-seq) offer the ability to measure simultaneously the expression level of thousands of genes in a single experiment!

More information

Matching Algorithms. Proof. If a bipartite graph has a perfect matching, then it is easy to see that the right hand side is a necessary condition.

Matching Algorithms. Proof. If a bipartite graph has a perfect matching, then it is easy to see that the right hand side is a necessary condition. 18.433 Combinatorial Optimization Matching Algorithms September 9,14,16 Lecturer: Santosh Vempala Given a graph G = (V, E), a matching M is a set of edges with the property that no two of the edges have

More information

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1 Big Data Methods Chapter 5: Machine learning Big Data Methods, Chapter 5, Slide 1 5.1 Introduction to machine learning What is machine learning? Concerned with the study and development of algorithms that

More information

Creating a Classifier for a Focused Web Crawler

Creating a Classifier for a Focused Web Crawler Creating a Classifier for a Focused Web Crawler Nathan Moeller December 16, 2015 1 Abstract With the increasing size of the web, it can be hard to find high quality content with traditional search engines.

More information

Nonparametric Methods Recap

Nonparametric Methods Recap Nonparametric Methods Recap Aarti Singh Machine Learning 10-701/15-781 Oct 4, 2010 Nonparametric Methods Kernel Density estimate (also Histogram) Weighted frequency Classification - K-NN Classifier Majority

More information

Computer Experiments. Designs

Computer Experiments. Designs Computer Experiments Designs Differences between physical and computer Recall experiments 1. The code is deterministic. There is no random error (measurement error). As a result, no replication is needed.

More information

Hashing. Hashing Procedures

Hashing. Hashing Procedures Hashing Hashing Procedures Let us denote the set of all possible key values (i.e., the universe of keys) used in a dictionary application by U. Suppose an application requires a dictionary in which elements

More information

HEURISTIC OPTIMIZATION USING COMPUTER SIMULATION: A STUDY OF STAFFING LEVELS IN A PHARMACEUTICAL MANUFACTURING LABORATORY

HEURISTIC OPTIMIZATION USING COMPUTER SIMULATION: A STUDY OF STAFFING LEVELS IN A PHARMACEUTICAL MANUFACTURING LABORATORY Proceedings of the 1998 Winter Simulation Conference D.J. Medeiros, E.F. Watson, J.S. Carson and M.S. Manivannan, eds. HEURISTIC OPTIMIZATION USING COMPUTER SIMULATION: A STUDY OF STAFFING LEVELS IN A

More information

arxiv: v1 [cs.ma] 8 May 2018

arxiv: v1 [cs.ma] 8 May 2018 Ordinal Approximation for Social Choice, Matching, and Facility Location Problems given Candidate Positions Elliot Anshelevich and Wennan Zhu arxiv:1805.03103v1 [cs.ma] 8 May 2018 May 9, 2018 Abstract

More information

Motivation. Technical Background

Motivation. Technical Background Handling Outliers through Agglomerative Clustering with Full Model Maximum Likelihood Estimation, with Application to Flow Cytometry Mark Gordon, Justin Li, Kevin Matzen, Bryce Wiedenbeck Motivation Clustering

More information

Parallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. November Parallel Sorting

Parallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. November Parallel Sorting Parallel Systems Course: Chapter VIII Sorting Algorithms Kumar Chapter 9 Jan Lemeire ETRO Dept. November 2014 Overview 1. Parallel sort distributed memory 2. Parallel sort shared memory 3. Sorting Networks

More information